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A small preface 


"Originally, this work has been prepared in the framework of a seminar of the 
University of Bonn in Germany, but it has been and will be extended (after 
being presented and published online under www.dkriesel.com on 
5/27/2005). First and foremost, to provide a comprehensive overview of the 
subject of neural networks and, second, just to acquire more and more 
knowledge about I^T^X . And who knows - maybe one day this summary will 

become a real preface!" 

Abstract of this work, end of 2005 


The above abstract has not yet become a 
preface but at least a little preface, ever 
since the extended text (then 40 pages 
long) has turned out to be a download 
hit. 

Ambition and intention of this 
manuscript 

The entire text is written and laid out 
more effectively and with more illustra¬ 
tions than before. I did all the illustra¬ 
tions myself, most of them directly in 
IAT^X by using XYpic. They reflect what 
I would have liked to see when becoming 
acquainted with the subject: Text and il¬ 
lustrations should be memorable and easy 
to understand to offer as many people as 
possible access to the field of neural net¬ 
works. 

Nevertheless, the mathematically and for¬ 
mally skilled readers will be able to under¬ 


stand the definitions without reading the 
running text, while the opposite holds for 
readers only interested in the subject mat¬ 
ter; everything is explained in both collo¬ 
quial and formal language. Please let me 
know if you find out that I have violated 
this principle. 

The sections of this text are mostly 
independent from each other 

The document itself is divided into differ¬ 
ent parts, which are again divided into 
chapters. Although the chapters contain 
cross-references, they are also individually 
accessible to readers with little previous 
knowledge. There are larger and smaller 
chapters: While the larger chapters should 
provide profound insight into a paradigm 
of neural networks (e.g. the classic neural 
network structure: the perceptron and its 
learning procedures), the smaller chapters 
give a short overview - but this is also ex- 
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plained in the introduction of each chapter. 
In addition to all the definitions and expla¬ 
nations I have included some excursuses 
to provide interesting information not di¬ 
rectly related to the subject. 

Unfortunately, I was not able to find free 
German sources that are multi-faceted 
in respect of content (concerning the 
paradigms of neural networks) and, nev¬ 
ertheless, written in coherent style. The 
aim of this work is (even if it could not 
be fulfilled at first go) to close this gap bit 
by bit and to provide easy access to the 
subject. 

Want to learn not only by 
reading, but also by coding? 
Use SNIPE! 

SNIPE 1 is a well-documented JAVA li¬ 
brary that implements a framework for 
neural networks in a speedy, feature-rich 
and usable way. It is available at no 
cost for non-commercial purposes. It was 
originally designed for high performance 
simulations with lots and lots of neural 
networks (even large ones) being trained 
simultaneously. Recently, I decided to 
give it away as a professional reference im¬ 
plementation that covers network aspects 
handled within this work, while at the 
same time being faster and more efficient 
than lots of other implementations due to 

1 Scalable and Generalized Neural Information Pro¬ 
cessing Engine, downloadable at http://www. 
dkriesel.com/tech/snipe online JavaDoc at 
http://snipe.dkriesel .com 


the original high-performance simulation 
design goal. Those of you who are up for 
learning by doing and/or have to use a 
fast and stable neural networks implemen¬ 
tation for some reasons, should definetely 
have a look at Snipe. 

However, the aspects covered by Snipe are 
not entirely congruent with those covered 
by this manuscript. Some of the kinds 
of neural networks are not supported by 
Snipe, while when it comes to other kinds 
of neural networks, Snipe may have lots 
and lots more capabilities than may ever 
be covered in the manuscript in the form 
of practical hints. Anyway, in my experi¬ 
ence almost all of the implementation re¬ 
quirements of my readers are covered well. 
On the Snipe download page, look for the 
section "Getting started with Snipe" - you 
will find an easy step-by-step guide con¬ 
cerning Snipe and its documentation, as 
well as some examples. 

SIMIPE: This manuscript frequently incor¬ 
porates Snipe. Shaded Snipe-paragraphs 
like this one are scattered among large 
parts of the manuscript, providing infor¬ 
mation on how to implement their con¬ 
text in Snipe. This also implies that 
those who do not want to use Snipe, 
just have to skip the shaded Snipe- 
paragraphs! The Snipe-paragraphs as¬ 
sume the reader has had a close look at 
the "Getting started with Snipe" section. 
Often, class names are used. As Snipe con¬ 
sists of only a few different packages, I omit¬ 
ted the package names within the qualified 
class names for the sake of readability. 
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It’s easy to print this 
manuscript 

This text is completely illustrated in 
color, but it can also be printed as is in 
monochrome: The colors of figures, tables 
and text are well-chosen so that in addi¬ 
tion to an appealing design the colors are 
still easy to distinguish when printed in 
monochrome. 


There are many tools directly 
integrated into the text 

Different aids are directly integrated in the 
document to make reading more flexible: 
However, anyone (like me) who prefers 
reading words on paper rather than on 
screen can also enjoy some features. 


In the table of contents, different 
types of chapters are marked 

Different types of chapters are directly 
marked within the table of contents. Chap¬ 
ters, that are marked as "fundamental" 
are definitely ones to read because almost 
all subsequent chapters heavily depend on 
them. Other chapters additionally depend 
on information given in other (preceding) 
chapters, which then is marked in the ta¬ 
ble of contents, too. 


Speaking headlines throughout the 
text, short ones in the table of 
contents 

The whole manuscript is now pervaded by 
such headlines. Speaking headlines are 
not just title-like ("Reinforcement Learn¬ 
ing"), but centralize the information given 
in the associated section to a single sen¬ 
tence. In the named instance, an appro¬ 
priate headline would be "Reinforcement 
learning methods provide feedback to the 
network, whether it behaves good or bad". 
However, such long headlines would bloat 
the table of contents in an unacceptable 
way. So I used short titles like the first one 
in the table of contents, and speaking ones, 
like the latter, throughout the text. 

Marginal notes are a navigational 
aid 

The entire document contains marginal 
notes in colloquial language (see the exam¬ 
ple in the margin), allowing you to "scan" 
the document quickly to find a certain pas¬ 
sage in the text (including the titles). 

New mathematical symbols are marked by 
specific marginal notes for easy finding 
(see the example for x in the margin). 

There are several kinds of indexing 

This document contains different types of 
indexing: If you have found a word in 
the index and opened the corresponding 
page, you can easily find it by searching 
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for highlighted text - all indexed words 
are highlighted like this. 

Mathematical symbols appearing in sev¬ 
eral chapters of this document (e.g. 17 for 
an output neuron; I tried to maintain a 
consistent nomenclature for regularly re¬ 
curring elements) are separately indexed 
under "Mathematical Symbols", so they 
can easily be assigned to the correspond¬ 
ing term. 

Names of persons written in small caps 
are indexed in the category "Persons" and 
ordered by the last names. 


Terms of use and license 

Beginning with the epsilon edition, the 
text is licensed under the Creative Com¬ 
mons Attribution-No Derivative Works 
3.0 Unported License 2 , except for some 
little portions of the work licensed under 
more liberal licenses as mentioned (mainly 
some figures from Wikimedia Commons). 
A quick license summary: 

1. You are free to redistribute this docu¬ 
ment (even though it is a much better 
idea to just distribute the URL of my 
homepage, for it always contains the 
most recent version of the text). 

2. You may not modify, transform, or 
build upon the document except for 
personal use. 

2 http://creativecommons.org/licenses/ 
by-nd/3.0/ 


3. You must maintain the author’s attri¬ 
bution of the document at all times. 

4. You may not use the attribution to 
imply that the author endorses you 
or your document use. 

For I’m no lawyer, the above bullet-point 
summary is just informational: if there is 
any conflict in interpretation between the 
summary and the actual license, the actual 
license always takes precedence. Note that 
this license does not extend to the source 
files used to produce the document. Those 
are still mine. 

How to cite this manuscript 

There’s no official publisher, so you need 
to be careful with your citation. Please 
find more information in English and 
German language on my homepage, re¬ 
spectively the subpage concerning the 
manuscript 3 . 
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Part I 

From biology to formalization — 
motivation, philosophy, history and 
realization of neural models 


1 



Computers 

cannot 

learn 


Chapter 1 

Introduction, motivation and history 


How to teach a computer? You can either write a fixed program - or you can 
enable the computer to learn on its own. Living beings do not have any 
programmer writing a program for developing their skills, which then only has 
to be executed. They learn by themselves - without the previous knowledge 
from external impressions - and thus can solve problems better than any 
computer today. What qualities are needed to achieve such a behavior for 
devices like computers? Can such cognition be adapted from biology? History, 
development, decline and resurgence of a wide approach to solve problems. 


1.1 Why neural networks? 

There are problem categories that cannot 
be formulated as an algorithm. Problems 
that depend on many subtle factors, for ex¬ 
ample the purchase price of a real estate 
which our brain can (approximately) cal¬ 
culate. Without an algorithm a computer 
cannot do the same. Therefore the ques¬ 
tion to be asked is: How do we learn to 
explore such problems? 

Exactly - we learn ; a capability comput¬ 
ers obviously do not have. Humans have 
a brain that can learn. Computers have 
some processing units and memory. They 
allow the computer to perform the most 
complex numerical calculations in a very 
short time, but they are not adaptive. 


If we compare computer and brain 1 * , we 
will note that, theoretically, the computer 
should be more powerful than our brain: 
It comprises 10 9 transistors with a switch¬ 
ing time of 1CT 9 seconds. The brain con¬ 
tains 10 11 neurons, but these only have a 
switching time of about 10~ 3 * * * * * seconds. 

The largest part of the brain is work¬ 
ing continuously, while the largest part of 
the computer is only passive data storage. 
Thus, the brain is parallel and therefore 
performing close to its theoretical rnaxi- 

1 Of course, this comparison is - for obvious rea¬ 

sons - controversially discussed by biologists and 

computer scientists, since response time and quan¬ 

tity do not tell anything about quality and perfor¬ 

mance of the processing units as well as neurons 

and transistors cannot be compared directly. Nev¬ 

ertheless, the comparison serves its purpose and 

indicates the advantage of parallelism by means 

of processing time. 


parallelism 


3 
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simple 
but many 
processing 
units 


n. network 
capable 
to learn 



Brain 

Computer 

No. of processing units 

« 10 11 

« 10 9 

Type of processing units 

Neurons 

Transistors 

Type of calculation 

massively parallel 

usually serial 

Data storage 

associative 

address-based 

Switching time 

« 10 _3 s 

« 10 _9 s 

Possible switching operations 

* 1Ql3 I 

* 1Ql8 I 

Actual switching operations 

* 10l2 i 

« 10 loi 

s 


Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: 


Zel94 


mum, from which the computer is orders 
of magnitude away (Table 1.1). Addition¬ 
ally, a computer is static - the brain as 
a biological neural network can reorganize 
itself during its "lifespan" and therefore is 
able to learn, to compensate errors and so 
forth. 


eralize and associate data: After suc¬ 
cessful training a neural network can find 
reasonable solutions for similar problems 
of the same class that were not explicitly 
trained. This in turn results in a high de¬ 
gree of fault tolerance against noisy in¬ 
put data. 


Within this text I want to outline how 
we can use the said characteristics of our 
brain for a computer system. 

So the study of artificial neural networks 
is motivated by their similarity to success¬ 
fully working biological systems, which - in 
comparison to the overall system - consist 
of very simple but numerous nerve cells 
that work massively in parallel and (which 
is probably one of the most significant 
aspects) have the capability to learn. 
There is no need to explicitly program a 
neural network. For instance, it can learn 
from training samples or by means of en¬ 
couragement - with a carrot and a stick, 
so to speak ( reinforcement learning). 

One result from this learning procedure is 
the capability of neural networks to gen¬ 


Fault tolerance is closely related to biolog¬ 
ical neural networks, in which this charac¬ 
teristic is very distinct: As previously men¬ 
tioned, a human has about 10 11 neurons 
that continuously reorganize themselves 
or are reorganized by external influences 
(about 10 5 neurons can be destroyed while 
in a drunken stupor, some types of food 
or environmental influences can also de¬ 
stroy brain cells). Nevertheless, our cogni¬ 
tive abilities are not significantly affected. 
Thus, the brain is tolerant against internal 
errors - and also against external errors, 
for we can often read a really "dreadful 
scrawl" although the individual letters are 
nearly impossible to read. 

Our modern technology, however, is not 
automatically fault-tolerant. I have never 
heard that someone forgot to install the 


n. network 

fault 

tolerant 
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1.1 Why neural networks? 


hard disk controller into a computer and 
therefore the graphics card automatically 
took over its tasks, i.e. removed con¬ 
ductors and developed communication, so 
that the system as a whole was affected 
by the missing component, but not com¬ 
pletely destroyed. 

A disadvantage of this distributed fault- 
tolerant storage is certainly the fact that 
we cannot realize at first sight what a neu¬ 
ral neutwork knows and performs or where 
its faults lie. Usually, it is easier to per¬ 
form such analyses for conventional algo¬ 
rithms. Most often we can only trans¬ 
fer knowledge into our neural network by 
means of a learning procedure , which can 
cause several errors and is not always easy 
to manage. 

Fault tolerance of data, on the other hand, 
is already more sophisticated in state-of- 
the-art technology: Let us compare a 
record and a CD. If there is a scratch on a 
record, the audio information on this spot 
will be completely lost (you will hear a 
pop) and then the music goes on. On a CD 
the audio data are distributedly stored: A 
scratch causes a blurry sound in its vicin¬ 
ity, but the data stream remains largely 
unaffected. The listener won’t notice any¬ 
thing. 

So let us summarize the main characteris¬ 
tics we try to adapt from biology: 

> Self-organization and learning capa¬ 
bility, 

> Generalization capability and 

> Fault tolerance. 


What types of neural networks particu¬ 
larly develop what kinds of abilities and 
can be used for what problem classes will 
be discussed in the course of this work. 

In the introductory chapter I want to 
clarify the following: " The neural net¬ 
work" does not exist. There are differ¬ 
ent paradigms for neural networks, how 
they are trained and where they are used. 
My goal is to introduce some of these 
paradigms and supplement some remarks 
for practical application. 

We have already mentioned that our brain 
works massively in parallel, in contrast to 
the functioning of a computer, i.e. every 
component is active at any time. If we 
want to state an argument for massive par¬ 
allel processing, then the 100-step rule 
can be cited. 


1.1.1 The 100-step rule 

Experiments showed that a human can 
recognize the picture of a familiar object 
or person in ~ 0.1 seconds, which cor¬ 
responds to a neuron switching time of 
« 10~ 3 seconds in ~ 100 discrete time 
steps of parallel processing. 

A computer following the von Neumann 
architecture, however, can do practically 
nothing in 100 time steps of sequential pro¬ 
cessing, which are 100 assembler steps or 
cycle steps. 

Now we want to look at a simple applica¬ 
tion example for a neural network. 


Important! 


parallel 

processing 
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Figure 1.1: A small robot with eight sensors 
and two motors. The arrow indicates the driv¬ 
ing direction. 


1.1.2 Simple application examples 

Let us assume that we have a small robot 
as shown in fig. This robot has eight 
distance sensors from which it extracts in¬ 
put data: Three sensors are placed on the 
front right, three on the front left, and two 
on the back. Each sensor provides a real 
numeric value at any time, that means we 
are always receiving an input iGl 8 . 

Despite its two motors (which will be 
needed later) the robot in our simple ex¬ 
ample is not capable to do much: It shall 
only drive on but stop when it might col¬ 
lide with an obstacle. Thus, our output 
is binary: H = 0 for "Everything is okay, 
drive on" and H = 1 for "Stop" (The out¬ 


put is called H for "halt signal"). There¬ 
fore we need a mapping 

/ : M 8 -►B 1 , 

that applies the input signals to a robot 
activity. 


1.1.2.1 The classical way 

There are two ways of realizing this map¬ 
ping. On the one hand, there is the clas¬ 
sical way. We sit down and think for a 
while, and finally the result is a circuit or 
a small computer program which realizes 
the mapping (this is easily possible, since 
the example is very simple). After that 
we refer to the technical reference of the 
sensors, study their characteristic curve in 
order to learn the values for the different 
obstacle distances, and embed these values 
into the aforementioned set of rules. Such 
procedures are applied in the classic artifi¬ 
cial intelligence, and if you know the exact 
rules of a mapping algorithm, you are al¬ 
ways well advised to follow this scheme. 


1.1.2.2 The way of learning 


On the other hand, more interesting and 
more successful for many mappings and 
problems that are hard to comprehend 
straightaway is the way of learning : We 
show different possible situations to the 
robot (fig. 1.2 on page 8), - and the robot 
shall learn on its own what to do in the 
course of its robot life. 


In this example the robot shall simply 
learn when to stop. We first treat the 
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Figure 1.3: Initially, we regard the robot control 
as a black box whose inner life is unknown. The 
black box receives eight real sensor values and 
maps these values to a binary output value. 


neural network as a kind of black box 
(fig. |1.3[ ). This means we do not know its 
structure but just regard its behavior in 
practice. 

The situations in form of simply mea¬ 
sured sensor values (e.g. placing the robot 
in front of an obstacle, see illustration), 
which we show to the robot and for which 
we specify whether to drive on or to stop, 
are called training samples. Thus, a train¬ 
ing sample consists of an exemplary input 
and a corresponding desired output. Now 
the question is how to transfer this knowl¬ 
edge, the information, into the neural net¬ 
work. 

The samples can be taught to a neural 
network by using a simple learning pro¬ 
cedure (a learning procedure is a simple 
algorithm or a mathematical formula. If 
we have done everything right and chosen 
good samples, the neural network will gen¬ 
eralize from these samples and find a uni¬ 
versal rule when it has to stop. 


Our example can be optionally expanded. 
For the purpose of direction control it 
would be possible to control the motors 
of our robot separately 2 , with the sensor 
layout being the same. In this case we are 
looking for a mapping 

/ : M * * 7 8 -> M 2 , 

which gradually controls the two motors 
by means of the sensor inputs and thus 
cannot only, for example, stop the robot 
but also lets it avoid obstacles. Here it 
is more difficult to analytically derive the 
rules, and de facto a neural network would 
be more appropriate. 

Our goal is not to learn the samples by 
heart, but to realize the principle behind 
them: Ideally, the robot should apply the 
neural network in any situation and be 
able to avoid obstacles. In particular, the 
robot should query the network continu¬ 
ously and repeatedly while driving in order 
to continously avoid obstacles. The result 
is a constant cycle: The robot queries the 
network. As a consequence, it will drive 
in one direction, which changes the sen¬ 
sors values. Again the robot queries the 
network and changes its position, the sen¬ 
sor values are changed once again, and so 
on. It is obvious that this system can also 
be adapted to dynamic, i.e changing, en¬ 
vironments (e.g. the moving obstacles in 
our example). 

2 There is a robot called Khepera with more or less 

similar characteristics. It is round-shaped, approx. 

7 cm in diameter, has two motors with wheels 

and various sensors. For more information I rec¬ 
ommend to refer to the internet. 
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Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa¬ 
tions. We add the desired output values H and so receive our learning samples. The directions in 
which the sensors are oriented are exemplarily applied to two robots. 


1.2 A brief history of neural 
networks 


The field of neural networks has, like any 
other field of science, a long history of 
development with many ups and downs, 
as we will see soon. To continue the style 
of my work I will not represent this history 
in text form but more compact in form of a 
timeline. Citations and bibliographical ref¬ 
erences are added mainly for those topics 
that will not be further discussed in this 
text. Citations for keywords that will be 
explained later are mentioned in the corre¬ 
sponding chapters. 

The history of neural networks begins in 
the early 1940’s and thus nearly simulta¬ 


neously with the history of programmable 
electronic computers. The youth of this 
field of research, as with the field of com¬ 
puter science itself, can be easily recog¬ 
nized due to the fact that many of the 
cited persons are still with us. 


1.2.1 The beginning 


As soon as 1943 Warren McCulloch 
and Walter Pitts introduced mod¬ 
els of neurological networks, recre¬ 
ated threshold switches based on neu¬ 
rons and showed that even simple 
networks of this kind are able to 
calculate nearly any logic or arith¬ 
metic function |MP43 . Further- 
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1.2 History of neural networks 



Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu¬ 
mann, Donald 0. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, John 
Hopfield, "in the order of appearance" as far as possible. 


more, the first computer precur¬ 
sors (" electronic brains") were de¬ 
veloped, among others supported by 
Konrad Zuse, who was tired of cal¬ 
culating ballistic trajectories by hand. 

1947: Walter Pitts and Warren Mc¬ 
Culloch indicated a practical field 
of application (which was not men¬ 
tioned in their work from 1943), 
namely the recognition of spacial pat¬ 
terns by neural networks |PM47 . 

1949: Donald O. Hebb formulated the 
classical Hebbian rule | Heb49 which 
represents in its more generalized 
form the basis of nearly all neural 
learning procedures. The rule im¬ 
plies that the connection between two 
neurons is strengthened when both 
neurons are active at the same time. 
This change in strength is propor¬ 
tional to the product of the two activ¬ 
ities. Hebb could postulate this rule, 
but due to the absence of neurological 
research he was not able to verify it. 

1950: The neuropsychologist Karl 
Lashley defended the thesis that 


brain information storage is realized 
as a distributed system. His thesis 
was based on experiments on rats, 
where only the extent but not the 
location of the destroyed nerve tissue 
influences the rats’ performance to 
find their way out of a labyrinth. 

1.2.2 Golden age 

1951: For his dissertation Marvin Min¬ 
sky developed the neurocomputer 
Snark, which has already been capa¬ 
ble to adjust its weights 3 automati¬ 
cally. But it has never been practi¬ 
cally implemented, since it is capable 
to busily calculate, but nobody really 
knows what it calculates. 

1956: Well-known scientists and ambi¬ 
tious students met at the Dart¬ 
mouth Summer Research Project 
and discussed, to put it crudely, how 
to simulate a brain. Differences be¬ 
tween top-down and bottom-up re¬ 
search developed. While the early 

3 We will learn soon what weights are. 
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development 

accelerates 


first 

spread 

use 


supporters of artificial intelligence 
wanted to simulate capabilities by 
means of software, supporters of neu¬ 
ral networks wanted to achieve sys¬ 
tem behavior by imitating the small¬ 
est parts of the system - the neurons. 

1957-1958: At the MIT, Frank Rosen¬ 
blatt, Charles Wightman and 
their coworkers developed the first 
successful neurocomputer, the Mark 
I perceptron, which was capable to 
recognize simple numerics by means 
of a 20 x 20 pixel image sensor and 
electromechanically worked with 512 
motor driven potentiometers - each 
potentiometer representing one vari¬ 
able weight. 

1959: Frank Rosenblatt described dif¬ 
ferent versions of the perceptron, for¬ 
mulated and verified his perceptron 
convergence theorem. He described 
neuron layers mimicking the retina, 
threshold switches, and a learning 
rule adjusting the connecting weights. 

1960: Bernard Widrow and Mar- 
CIAN E. Hoff introduced the ADA- 
LINE (ADAptive Linear NEu- 
ron) [WH60| , a fast and precise 
adaptive learning system being the 
first widely commercially used neu¬ 
ral network: It could be found in 
nearly every analog telephone for real¬ 
time adaptive echo filtering and was 
trained by rnenas of the Widrow-Hoff 
rule or delta rule. At that time Hoff, 
later co-founder of Intel Corporation, 
was a PhD student of Widrow, who 
himself is known as the inventor of 


modern microprocessors. One advan¬ 
tage the delta rule had over the origi¬ 
nal perceptron learning algorithm was 
its adaptivity. If the difference be¬ 
tween the actual output and the cor¬ 
rect solution was large, the connect¬ 
ing weights also changed in larger 
steps - the smaller the steps, the 
closer the target was. Disadvantage: 
misapplication led to infinitesimal 
small steps close to the target. In the 
following stagnation and out of fear 
of scientific unpopularity of the neu¬ 
ral networks AD ALINE was renamed 
in adaptive linear element - which 
was undone again later on. 


1961: Karl Steinbucli introduced tech¬ 
nical realizations of associative mem¬ 
ory, which can be seen as predecessors 
of today’s neural associative mem¬ 
ories [Ste61 . Additionally, he de¬ 
scribed concepts for neural techniques 
and analyzed their possibilities and 
limits. 


1965: In his book Learning Machines , 
Nils Nilsson gave an overview of 
the progress and works of this period 
of neural network research. It was 
assumed that the basic principles of 
self-learning and therefore, generally 
speaking, "intelligent" systems had al¬ 
ready been discovered. Today this as¬ 
sumption seems to be an exorbitant 
overestimation, but at that time it 
provided for high popularity and suf¬ 
ficient research funds. 

1969: Marvin Minsky and Seymour 
Papert published a precise rnathe- 
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1.2 History of neural networks 


research 
funds were 
stopped 


matical analysis 
[MP69 


of the perceptron 
to show that the perceptron 
model was not capable of representing 
many important problems (keywords: 
XOR problem and linear separability ), 
and so put an end to overestimation, 
popularity and research funds. The 
implication that more powerful mod¬ 
els would show exactly the same prob¬ 
lems and the forecast that the entire 
field would be a research dead end re¬ 
sulted in a nearly complete decline in 
research funds for the next 15 years 
- no matter how incorrect these fore¬ 
casts were from today’s point of view. 


1.2.3 Long silence and slow 
reconstruction 


The research funds were, as previously- 
mentioned, extremely short. Everywhere 
research went on, but there were neither 
conferences nor other events and therefore 
only few publications. This isolation of 
individual researchers provided for many 
independently developed neural network 
paradigms: They researched, but there 
was no discourse among them. 

In spite of the poor appreciation the field 
received, the basic theories for the still 
continuing renaissance were laid at that 
time: 


1972: Teuvo Kohonen introduced a 
model of the linear associator, 
a model of an associative memory 
[Koh72 . In the same year, such a 
model was presented independently 
and from a neurophysiologist’s point 


of view by James A. Anderson 
[And72 . 

1973: Christoph von der Malsburg 
used a neuron model that was non¬ 
linear and biologically more moti¬ 
vated jvdM73 . 


1974: For his dissertation in Harvard 
Paul Werbos developed a learning 
procedure called backpropagation of 
error | Wer74| , but it was not until 
one decade later that this procedure 
reached today’s importance. 


1976-1980 and thereafter: Stephen 
Grossberg presented many papers 
(for instance [Gro76 j) in which 
numerous neural models are analyzed 
mathematically. Furthermore, he 
dedicated himself to the problem of 
keeping a neural network capable 
of learning without destroying 
already learned associations. Under 
cooperation of Gail Carpenter 
this led to models of adaptive 
resonance theory (ART). 


1982: Teuvo Kohonen described the 
self-organizing feature maps 
(SOM) |Koh82| |Koh98| also 

known as Kohonen maps. He was 
looking for the mechanisms involving 
self-organization in the brain (He 
knew that the information about the 
creation of a being is stored in the 
genome, which has, however, not 
enough memory for a structure like 
the brain. As a consequence, the 
brain has to organize and create 
itself for the most part). 


backprop 

developed 
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Renaissance 


John Hopfield also invented the 
so-called Hopfield networks (Hop8 2 
which are inspired by the laws of mag¬ 
netism in physics. They were not 
widely used in technical applications, 
but the field of neural networks slowly 
regained importance. 


1983: Fukushima, Miyake and Ito in¬ 
troduced the neural model of the 
Neocognitron which could recognize 
handwritten characters [FMI83 and 
was an extension of the Cognitron net¬ 
work already developed in 1975. 


1.2.4 Renaissance 


Through the influence of John Hopfield, 
who had personally convinced many re¬ 
searchers of the importance of the field, 
and the wide publication of backpro- 
pagation by Rumelhart, Hinton and 
Williams, the field of neural networks 
slowly showed signs of upswing. 

1985: John Hopfield published an arti¬ 
cle describing a way of finding accept¬ 
able solutions for the Travelling Sales¬ 
man problem by using Hopfield nets. 


1986: The backpropagation of error learn¬ 
ing procedure as a generalization of 
the delta rule was separately devel¬ 
oped and widely published by the Par¬ 
allel Distributed Processing Group 
|RHW86aj: Non-linearly-separable 

problems could be solved by multi¬ 
layer perceptrons, and Marvin Min¬ 
sky’s negative evaluations were dis- 
proven at a single blow. At the same 


time a certain kind of fatigue spread 
in the field of artificial intelligence, 
caused by a series of failures and un¬ 
fulfilled hopes. 

From this time on, the development of 
the field of research has almost been 
explosive. It can no longer be item¬ 
ized, but some of its results will be 
seen in the following. 


Exercises 


Exercise 1. Give one example for each 
of the following topics: 

> A book on neural networks or neuroin¬ 
formatics, 

> A collaborative group of a university 
working with neural networks, 

> A software tool realizing neural net¬ 
works ("simulator"), 

> A company using neural networks, 
and 

> A product or service being realized by 
means of neural networks. 

Exercise 2. Show at least four applica¬ 
tions of technical neural networks: two 
from the field of pattern recognition and 
two from the field of function approxima¬ 
tion. 

Exercise 3. Briefly characterize the four 
development phases of neural networks 
and give expressive examples for each 
phase. 
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Chapter 2 

Biological neural networks 


How do biological systems solve problems? How does a system of neurons 
work? How can we understand its functionality? What are different quantities 
of neurons able to do? Where in the nervous system does information 
processing occur? A short biological overview of the complexity of simple 
elements of neural information processing followed by some thoughts about 
their simplification in order to technically adapt them. 


Before we begin to describe the technical 
side of neural networks, it would be use¬ 
ful to briefly discuss the biology of neu¬ 
ral networks and the cognition of living 
organisms - the reader may skip the fol¬ 
lowing chapter without missing any tech¬ 
nical information. On the other hand I 
recommend to read the said excursus if 
you want to learn something about the 
underlying neurophysiology and see that 
our small approaches, the technical neural 
networks, are only caricatures of nature 
- and how powerful their natural counter¬ 
parts must be when our small approaches 
are already that effective. Now we want 
to take a brief look at the nervous system 
of vertebrates: We will start with a very 
rough granularity and then proceed with 
the brain and up to the neural level. For 
further reading I want to recommend the 
books (CR00||KSJ00 


which helped me a 


lot during this chapter. 


2.1 The vertebrate nervous 
system 


The entire information processing system, 
i.e. the vertebrate nervous system , con¬ 
sists of the central nervous system and the 
peripheral nervous system, which is only 
a first and simple subdivision. In real¬ 
ity, such a rigid subdivision does not make 
sense, but here it is helpful to outline the 
information processing in a body. 


2.1.1 Peripheral and central 
nervous system 

The peripheral nervous system ( PNS ) 
comprises the nerves that are situated out¬ 
side of the brain or the spinal cord. These 
nerves form a branched and very dense net¬ 
work throughout the whole body. The pe- 
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ripheral nervous system includes, for ex¬ 
ample, the spinal nerves which pass out 
of the spinal cord (two within the level of 
each vertebra of the spine) and supply ex¬ 
tremities, neck and trunk, but also the cra¬ 
nial nerves directly leading to the brain. 


The central nervous system ( CNS ), 
however, is the "main-frame" within the 
vertebrate. It is the place where infor¬ 
mation received by the sense organs are 
stored and managed. Furthermore, it con¬ 
trols the inner processes in the body and, 
last but not least, coordinates the mo¬ 
tor functions of the organism. The ver¬ 
tebrate central nervous system consists of 
the brain and the spinal cord (Fig. |2.l[ ). 
However, we want to focus on the brain, 
which can - for the purpose of simplifica¬ 


tion - be divided into four areas (Fig. 2.2 


on the next page I to be discussed here. 


2.1.2 The cerebrum is responsible 
for abstract thinking 
processes. 

The cerebrum (telencephalon ) is one of 
the areas of the brain that changed most 
during evolution. Along an axis, running 
from the lateral face to the back of the 
head, this area is divided into two hemi¬ 
spheres, which are organized in a folded 
structure. These cerebral hemispheres 
are connected by one strong nerve cord 
("bar") and several small ones. A large 
number of neurons are located in the cere¬ 
bral cortex (cortex ) which is approx. 2- 
4 cm thick and divided into different cor¬ 
tical fields, each having a specific task to 



Figure 2.1: Illustration of the central nervous 
system with spinal cord and brain. 
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2.1 The vertebrate nervous system 



Figure 2.2: Illustration of the brain. The col¬ 
ored areas of the brain are discussed in the text. 
The more we turn from abstract information pro¬ 
cessing to direct reflexive processing, the darker 
the areas of the brain are colored. 


fulfill. Primary cortical fields are re¬ 
sponsible for processing qualitative infor¬ 
mation, such as the management of differ¬ 
ent perceptions (e.g. the visual cortex 
is responsible for the management of vi¬ 
sion). Association cortical fields, how¬ 
ever, perform more abstract association 
and thinking processes; they also contain 
our memory. 

2.1.3 The cerebellum controls and 
coordinates motor functions 

The cerebellum is located below the cere¬ 
brum, therefore it is closer to the spinal 
cord. Accordingly, it serves less abstract 
functions with higher priority: Here, large 
parts of motor coordination are performed, 
i.e., balance and movements are controlled 


and errors are continually corrected. For 
this purpose, the cerebellum has direct 
sensory information about muscle lengths 
as well as acoustic and visual informa¬ 
tion. Furthermore, it also receives mes¬ 
sages about more abstract motor signals 
coming from the cerebrum. 

In the human brain the cerebellum is con¬ 
siderably smaller than the cerebrum, but 
this is rather an exception. In many ver¬ 
tebrates this ratio is less pronounced. If 
we take a look at vertebrate evolution, we 
will notice that the cerebellum is not "too 
small" but the cerebum is "too large" (at 
least, it is the most highly developed struc¬ 
ture in the vertebrate brain). The two re¬ 
maining brain areas should also be briefly 
discussed: the diencephalon and the brain¬ 
stem. 


2.1.4 The diencephalon controls 
fundamental physiological 
processes 

The interbrain (diencephalon ) includes 
parts of which only the thalamus will 
be briefly discussed: This part of the di¬ 
encephalon mediates between sensory and 
motor signals and the cerebrum. Particu¬ 
larly, the thalamus decides which part of 
the information is transferred to the cere¬ 
brum, so that especially less important 
sensory perceptions can be suppressed at 
short notice to avoid overloads. Another 
part of the diencephalon is the hypotha¬ 
lamus, which controls a number of pro¬ 
cesses within the body. The diencephalon 


thalamus 

filters 

incoming 

data 


D. Kriesel - A Brief Introduction to Neural Networks (ZETA2-EN) 


15 









Chapter 2 Biological neural networks 


dkriesel.com 


is also heavily involved in the human cir¬ 
cadian rhythm ("internal clock") and the 
sensation of pain. 


2.1.5 The brainstem connects the 
brain with the spinal cord and 
controls reflexes. 

In comparison with the diencephalon the 
brainstem or the (truncus cerebri ) re¬ 
spectively is phylogenetically much older. 
Roughly speaking, it is the "extended 
spinal cord" and thus the connection be¬ 
tween brain and spinal cord. The brain¬ 
stem can also be divided into different ar¬ 
eas, some of which will be exemplarily in¬ 
troduced in this chapter. The functions 
will be discussed from abstract functions 
towards more fundamental ones. One im¬ 
portant component is the pons (=bridge), 
a kind of transit station for many nerve sig¬ 
nals from brain to body and vice versa. 

If the pons is damaged (e.g. by a cere¬ 
bral infarct), then the result could be the 
locked-in syndrome - a condition in 
which a patient is "walled-in" within his 
own body. He is conscious and aware 
with no loss of cognitive function, but can¬ 
not move or communicate by any means. 
Only his senses of sight, hearing, smell and 
taste are generally working perfectly nor¬ 
mal. Locked-in patients may often be able 
to communicate with others by blinking or 
moving their eyes. 

Furthermore, the brainstem is responsible 
for many fundamental reflexes, such as the 
blinking reflex or coughing. 


All parts of the nervous system have one 
thing in common: information processing. 
This is accomplished by huge accumula¬ 
tions of billions of very similar cells, whose 
structure is very simple but which com¬ 
municate continuously. Large groups of 
these cells send coordinated signals and 
thus reach the enormous information pro¬ 
cessing capacity we are familiar with from 
our brain. We will now leave the level of 
brain areas and continue with the cellular 
level of the body - the level of neurons. 


2.2 Neurons are information 
processing cells 


Before specifying the functions and pro¬ 
cesses within a neuron, we will give a 
rough description of neuron functions: A 
neuron is nothing more than a switch with 
information input and output. The switch 
will be activated if there are enough stim¬ 
uli of other neurons hitting the informa¬ 
tion input. Then, at the information out¬ 
put, a pulse is sent to, for example, other 
neurons. 


2.2.1 Components of a neuron 


Now we want to take a look at the com¬ 


ponents of a neuron (Fig. 2.3 on the fac¬ 


ing page). In doing so, we will follow the 


way the electrical information takes within 
the neuron. The dendrites of a neuron 
receive the information by special connec¬ 
tions, the synapses. 
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2.2 The neuron 


electrical 

synapse: 

simple 


Dendrite 



Figure 2.3: Illustration of a biological neuron with the components discussed in this text. 


2.2.1.1 Synapses weight the individual 
parts of information 

Incoming signals from other neurons or 
cells are transferred to a neuron by special 
connections, the synapses. Such connec¬ 
tions can usually be found at the dendrites 
of a neuron, sometimes also directly at the 
soma. We distinguish between electrical 
and chemical synapses. 

The electrical synapse is the simpler 
variant. An electrical signal received by 
the synapse, i.e. coming from the presy- 
naptic side, is directly transferred to the 
postsynaptic nucleus of the cell. Thus, 
there is a direct, strong, unadjustable 
connection between the signal transmitter 
and the signal receiver, which is, for exam¬ 
ple, relevant to shortening reactions that 
must be "hard coded" within a living or¬ 
ganism. 


The chemical synapse is the more dis¬ 
tinctive variant. Here, the electrical cou¬ 
pling of source and target does not take 
place, the coupling is interrupted by the 
synaptic cleft. This cleft electrically sep¬ 
arates the presynaptic side from the post¬ 
synaptic one. You might think that, never¬ 
theless, the information has to flow, so we 
will discuss how this happens: It is not an 
electrical, but a chemical process. On the 
presynaptic side of the synaptic cleft the 
electrical signal is converted into a chemi¬ 
cal signal, a process induced by chemical 
cues released there (the so-called neuro¬ 
transmitters). These neurotransmitters 
cross the synaptic cleft and transfer the 
information into the nucleus of the cell 
(this is a very simple explanation, but later 
on we will see how this exactly works), 
where it is reconverted into electrical in¬ 
formation. The neurotransmitters are de¬ 
graded very fast, so that it is possible to re- 
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cemical 
synapse 
is more 
complex 
but also 
more 
powerful 


lease very precise information pulses here, 
too. 

In spite of the more complex function¬ 
ing, the chemical synapse has - compared 
with the electrical synapse - utmost advan¬ 
tages: 

One-way connection: A chemical 

synapse is a one-way connection. 
Due to the fact that there is no direct 
electrical connection between the 
pre- and postsynaptic area, electrical 
pulses in the postsynaptic area 
cannot flash over to the presynaptic 
area. 

Adjustability: There is a large number of 
different neurotransmitters that can 
also be released in various quantities 
in a synaptic cleft. There are neuro¬ 
transmitters that stimulate the post¬ 
synaptic cell nucleus, and others that 
slow down such stimulation. Some 
synapses transfer a strongly stimulat¬ 
ing signal, some only weakly stimu¬ 
lating ones. The adjustability varies 
a lot, and one of the central points 
in the examination of the learning 
ability of the brain is, that here the 
synapses are variable, too. That is, 
over time they can form a stronger or 
weaker connection. 

2.2.1.2 Dendrites collect all parts of 
information 

Dendrites branch like trees from the cell 
nucleus of the neuron (which is called 
soma) and receive electrical signals from 


many different sources, which are then 
transferred into the nucleus of the cell. 
The amount of branching dendrites is also 
called dendrite tree. 


2.2.1.3 In the soma the weighted 
information is accumulated 

After the cell nucleus (soma) has re¬ 
ceived a plenty of activating ^stimulat¬ 
ing) and inhibiting (=diminishing) signals 
by synapses or dendrites, the soma accu¬ 
mulates these signals. As soon as the ac¬ 
cumulated signal exceeds a certain value 
(called threshold value), the cell nucleus 
of the neuron activates an electrical pulse 
which then is transmitted to the neurons 
connected to the current one. 


2.2.1.4 The axon transfers outgoing 
pulses 

The pulse is transferred to other neurons 
by means of the axon. The axon is a 
long, slender extension of the soma. In 
an extreme case, an axon can stretch up 
to one meter (e.g. within the spinal cord). 
The axon is electrically isolated in order 
to achieve a better conduction of the elec¬ 
trical signal (we will return to this point 
later on) and it leads to dendrites, which 
transfer the information to, for example, 
other neurons. So now we are back at the 
beginning of our description of the neuron 
elements. An axon can, however, transfer 
information to other kinds of cells in order 
to control them. 
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2.2 The neuron 


2.2.2 Electrochemical processes in 
the neuron and its 
components 

After having pursued the path of an elec¬ 
trical signal from the dendrites via the 
synapses to the nucleus of the cell and 
from there via the axon into other den¬ 
drites, we now want to take a small step 
from biology towards technology. In doing 
so, a simplified introduction of the electro¬ 
chemical information processing should be 
provided. 

2.2.2.1 Neurons maintain electrical 
membrane potential 

One fundamental aspect is the fact that 
compared to their environment the neu¬ 
rons show a difference in electrical charge, 
a potential. In the membrane ^enve¬ 
lope) of the neuron the charge is different 
from the charge on the outside. This dif¬ 
ference in charge is a central concept that 
is important to understand the processes 
within the neuron. The difference is called 
membrane potential. The membrane 
potential, i.e., the difference in charge, is 
created by several kinds of charged atoms 
(ions), whose concentration varies within 
and outside of the neuron. If we penetrate 
the membrane from the inside outwards, 
we will find certain kinds of ions more of¬ 
ten or less often than on the inside. This 
descent or ascent of concentration is called 
a concentration gradient. 

Let us first take a look at the membrane 
potential in the resting state of the neu¬ 


ron, i.e., we assume that no electrical sig¬ 
nals are received from the outside. In this 
case, the membrane potential is —70 mV. 
Since we have learned that this potential 
depends on the concentration gradients of 
various ions, there is of course the central 
question of how to maintain these concen¬ 
tration gradients: Normally, diffusion pre¬ 
dominates and therefore each ion is eager 
to decrease concentration gradients and 
to spread out evenly. If this happens, 
the membrane potential will move towards 
0 mV, so finally there would be no mem¬ 
brane potential anymore. Thus, the neu¬ 
ron actively maintains its membrane po¬ 
tential to be able to process information. 
How does this work? 

The secret is the membrane itself, which is 
permeable to some ions, but not for others. 
To maintain the potential, various mecha¬ 
nisms are in progress at the same time: 

Concentration gradient: As described 
above the ions try to be as uniformly 
distributed as possible. If the 
concentration of an ion is higher on 
the inside of the neuron than on 
the outside, it will try to diffuse 
to the outside and vice versa. 
The positively charged ion K + 
(potassium) occurs very frequently 
within the neuron but less frequently 
outside of the neuron, and therefore 
it slowly diffuses out through the 
neuron’s membrane. But another 
group of negative ions, collectively 
called A”, remains within the neuron 
since the membrane is not permeable 
to them. Thus, the inside of the 
neuron becomes negatively charged. 
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Negative A ions remain, positive K 
ions disappear, and so the inside of 
the cell becomes more negative. The 
result is another gradient. 

Electrical Gradient: The electrical gradi¬ 
ent acts contrary to the concentration 
gradient. The intracellular charge is 
now very strong, therefore it attracts 
positive ions: K + wants to get back 
into the cell. 

If these two gradients were now left alone, 
they would eventually balance out, reach 
a steady state, and a membrane poten¬ 
tial of —85 mV would develop. But we 
want to achieve a resting membrane po¬ 
tential of —70 mV, thus there seem to ex¬ 
ist some disturbances which prevent this. 
Furthermore, there is another important 
ion, Na + (sodium), for which the mem¬ 
brane is not very permeable but which, 
however, slowly pours through the mem¬ 
brane into the cell. As a result, the sodium 
is driven into the cell all the more: On the 
one hand, there is less sodium within the 
neuron than outside the neuron. On the 
other hand, sodium is positively charged 
but the interior of the cell has negative 
charge, which is a second reason for the 
sodium wanting to get into the cell. 

Due to the low diffusion of sodium into the 
cell the intracellular sodium concentration 
increases. But at the same time the inside 
of the cell becomes less negative, so that 
K + pours in more slowly (we can see that 
this is a complex mechanism where every¬ 
thing is influenced by everything). The 
sodium shifts the intracellular equilibrium 
from negative to less negative, compared 


with its environment. But even with these 
two ions a standstill with all gradients be¬ 
ing balanced out could still be achieved. 
Now the last piece of the puzzle gets into 
the game: a "pump" (or rather, the protein 
ATP) actively transports ions against the 
direction they actually want to take! 

Sodium is actively pumped out of the cell, 
although it tries to get into the cell 
along the concentration gradient and 
the electrical gradient. 

Potassium, however, diffuses strongly out 
of the cell, but is actively pumped 
back into it. 

For this reason the pump is also called 
sodium-potassium pump. The pump 
maintains the concentration gradient for 
the sodium as well as for the potassium, 
so that some sort of steady state equilib¬ 
rium is created and finally the resting po¬ 
tential is —70 mV as observed. All in all 
the membrane potential is maintained by 
the fact that the membrane is imperme¬ 
able to some ions and other ions are ac¬ 
tively pumped against the concentration 
and electrical gradients. Now that we 
know that each neuron has a membrane 
potential we want to observe how a neu¬ 
ron receives and transmits signals. 

2.2.2.2 The neuron is activated by 
changes in the membrane 
potential 

Above we have learned that sodium and 
potassium can diffuse through the mem¬ 
brane - sodium slowly, potassium faster. 
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2.2 The neuron 


They move through channels within the 
membrane, the sodium and potassium 
channels. In addition to these per¬ 
manently open channels responsible for 
diffusion and balanced by the sodium- 
potassium pump, there also exist channels 
that are not always open but which only 
response "if required". Since the opening 
of these channels changes the concentra¬ 
tion of ions within and outside of the mem¬ 
brane, it also changes the membrane po¬ 
tential. 

These controllable channels are opened as 
soon as the accumulated received stimulus 
exceeds a certain threshold. For example, 
stimuli can be received from other neurons 
or have other causes. There exist, for ex¬ 
ample, specialized forms of neurons, the 
sensory cells, for which a light incidence 
could be such a stimulus. If the incom¬ 
ing amount of light exceeds the threshold, 
controllable channels are opened. 

The said threshold (the threshold poten¬ 
tial) lies at about —55 mV. As soon as the 
received stimuli reach this value, the neu¬ 
ron is activated and an electrical signal, 
an action potential , is initiated. Then 
this signal is transmitted to the cells con¬ 
nected to the observed neuron, i.e. the 
cells "listen" to the neuron. Now we want 
to take a closer look at the different stages 
of the action potential (Fig. 
page]): 

Resting state: Only the permanently 
open sodium and potassium channels 
are permeable. The membrane 
potential is at —70 mV and actively 
kept there by the neuron. 


2.4 on the next 


Stimulus up to the threshold: A stimu¬ 
lus opens channels so that sodium 
can pour in. The intracellular charge 
becomes more positive. As soon as 
the membrane potential exceeds the 
threshold of —55 mV, the action po¬ 
tential is initiated by the opening of 
many sodium channels. 

Depolarization: Sodium is pouring in. Re¬ 
member: Sodium wants to pour into 
the cell because there is a lower in¬ 
tracellular than extracellular concen¬ 
tration of sodium. Additionally, the 
cell is dominated by a negative en¬ 
vironment which attracts the posi¬ 
tive sodium ions. This massive in¬ 
flux of sodium drastically increases 
the membrane potential - up to ap¬ 
prox. +30 mV - which is the electrical 
pulse, i.e., the action potential. 

Repolarization: Now the sodium channels 
are closed and the potassium channels 
are opened. The positively charged 
ions want to leave the positive inte¬ 
rior of the cell. Additionally, the intra¬ 
cellular concentration is much higher 
than the extracellular one, which in¬ 
creases the efflux of ions even more. 
The interior of the cell is once again 
more negatively charged than the ex¬ 
terior. 

Hyperpolarization: Sodium as well as 
potassium channels are closed again. 
At first the membrane potential is 
slightly more negative than the rest¬ 
ing potential. This is due to the 
fact that the potassium channels close 
more slowly. As a result, (positively 
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2.2 The neuron 


charged) potassium effuses because of 
its lower extracellular concentration. 
After a refractory period of 1 — 2 
ms the resting state is re-established 
so that the neuron can react to newly 
applied stimuli with an action poten¬ 
tial. In simple terms, the refractory 
period is a mandatory break a neu¬ 
ron has to take in order to regenerate. 
The shorter this break is, the more 
often a neuron can fire per time. 

Then the resulting pulse is transmitted by 

the axon. 


2.2.2.3 In the axon a pulse is 

conducted in a saltatory way 


We have already learned that the axon 
is used to transmit the action potential 
across long distances (remember: You will 
find an illustration of a neuron including 
an axon in Fig. 2.3 on page 17). The axon 
is a long, slender extension of the soma. 
In vertebrates it is normally coated by a 
myelin sheath that consists of Schwann 
cells (in the PNS) or oligodendrocytes 
(in the CNS) 1 , which insulate the axon 
very well from electrical activity. At a dis¬ 
tance of 0.1 — 2mm there are gaps between 
these cells, the so-called nodes of Ran- 
vier. The said gaps appear where one in¬ 
sulate cell ends and the next one begins. 
It is obvious that at such a node the axon 
is less insulated. 


1 Schwann cells as well as oligodendrocytes are vari¬ 
eties of the glial cells. There are about 50 times 
more glial cells than neurons: They surround the 
neurons (glia = glue), insulate them from each 
other, provide energy, etc. 


Now you may assume that these less in¬ 
sulated nodes are a disadvantage of the 
axon - however, they are not. At the 
nodes, mass can be transferred between 
the intracellular and extracellular area, a 
transfer that is impossible at those parts 
of the axon which are situated between 
two nodes ( internodes) and therefore in¬ 
sulated by the myelin sheath. This mass 
transfer permits the generation of signals 
similar to the generation of the action po¬ 
tential within the soma. The action po¬ 
tential is transferred as follows: It does 
not continuously travel along the axon but 
jumps from node to node. Thus, a series 
of depolarization travels along the nodes of 
Ranvier. One action potential initiates the 
next one, and mostly even several nodes 
are active at the same time here. The 
pulse "jumping" from node to node is re¬ 
sponsible for the name of this pulse con¬ 
ductor: saltatory conductor. 

Obviously, the pulse will move faster if its 
jumps are larger. Axons with large in¬ 
ternodes (2 mm) achieve a signal disper¬ 
sion of approx. 180 meters per second. 
However, the internodes cannot grow in¬ 
definitely, since the action potential to be 
transferred would fade too much until it 
reaches the next node. So the nodes have 
a task, too: to constantly amplify the sig¬ 
nal. The cells receiving the action poten¬ 
tial are attached to the end of the axon - 
often connected by dendrites and synapses. 
As already indicated above, the action po¬ 
tentials are not only generated by informa¬ 
tion received by the dendrites from other 
neurons. 
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2.3 Receptor cells are 
modified neurons 


Action potentials can also be generated by 
sensory information an organism receives 
from its environment through its sensory 
cells. Specialized receptor cells are able 
to perceive specific stimulus energies such 
as light, temperature and sound or the ex¬ 
istence of certain molecules (like, for exam¬ 
ple, the sense of smell). This is working 
because of the fact that these sensory cells 
are actually modified neurons. They do 
not receive electrical signals via dendrites 
but the existence of the stimulus being 
specific for the receptor cell ensures that 
the ion channels open and an action po¬ 
tential is developed. This process of trans¬ 
forming stimulus energy into changes in 
the membrane potential is called sensory 
transduction. Usually, the stimulus en¬ 
ergy itself is too weak to directly cause 
nerve signals. Therefore, the signals are 
amplified either during transduction or by 
means of the stimulus-conducting ap¬ 
paratus. The resulting action potential 
can be processed by other neurons and is 
then transmitted into the thalamus, which 
is, as we have already learned, a gateway 
to the cerebral cortex and therefore can re¬ 
ject sensory impressions according to cur¬ 
rent relevance and thus prevent an abun¬ 
dance of information to be managed. 


2.3.1 There are different receptor 
cells for various types of 
perceptions 

Primary receptors transmit their pulses 
directly to the nervous system. A good 
example for this is the sense of pain. 
Here, the stimulus intensity is propor¬ 
tional to the amplitude of the action po¬ 
tential. Technically, this is an amplitude 
modulation. 

Secondary receptors, however, continu¬ 
ously transmit pulses. These pulses con¬ 
trol the amount of the related neurotrans¬ 
mitter, which is responsible for transfer¬ 
ring the stimulus. The stimulus in turn 
controls the frequency of the action poten¬ 
tial of the receiving neuron. This process 
is a frequency modulation, an encoding of 
the stimulus, which allows to better per¬ 
ceive the increase and decrease of a stimu¬ 
lus. 

There can be individual receptor cells or 
cells forming complex sensory organs (e.g. 
eyes or ears). They can receive stimuli 
within the body (by means of the intero- 
ceptors ) as well as stimuli outside of the 
body (by means of the exteroceptors ). 

After having outlined how information is 
received from the environment, it will be 
interesting to look at how the information 
is processed. 
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2.3 Receptor cells 


2.3.2 Information is processed on 
every level of the nervous 
system 

There is no reason to believe that all re¬ 
ceived information is transmitted to the 
brain and processed there, and that the 
brain ensures that it is "output" in the 
form of motor pulses (the only thing an 
organism can actually do within its envi¬ 
ronment is to move). The information pro¬ 
cessing is entirely decentralized. In order 
to illustrate this principle, we want to take 
a look at some examples, which leads us 
again from the abstract to the fundamen¬ 
tal in our hierarchy of information process¬ 
ing. 

> It is certain that information is pro¬ 
cessed in the cerebrum, which is the 
most developed natural information 
processing structure. 

> The midbrain and the thalamus, 
which serves - as we have already 
learned - as a gateway to the cere¬ 
bral cortex, are situated much lower 
in the hierarchy. The filtering of in¬ 
formation with respect to the current 
relevance executed by the midbrain 
is a very important method of infor¬ 
mation processing, too. But even the 
thalamus does not receive any prepro¬ 
cessed stimuli from the outside. Now 
let us continue with the lowest level, 
the sensory cells. 

> On the lowest level, i.e. at the recep¬ 
tor cells, the information is not only 
received and transferred but directly 
processed. One of the main aspects of 


this subject is to prevent the transmis¬ 
sion of "continuous stimuli" to the cen¬ 
tral nervous system because of sen¬ 
sory adaptation: Due to continu¬ 
ous stimulation many receptor cells 
automatically become insensitive to 
stimuli. Thus, receptor cells are not 
a direct mapping of specific stimu¬ 
lus energy onto action potentials but 
depend on the past. Other sensors 
change their sensitivity according to 
the situation: There are taste recep¬ 
tors which respond more or less to the 
same stimulus according to the nutri¬ 
tional condition of the organism. 

D> Even before a stimulus reaches the 
receptor cells, information processing 
can already be executed by a preced¬ 
ing signal carrying apparatus, for ex¬ 
ample in the form of amplification: 
The external and the internal ear 
have a specific shape to amplify the 
sound, which also allows - in asso¬ 
ciation with the sensory cells of the 
sense of hearing - the sensory stim¬ 
ulus only to increase logarithmically 
with the intensity of the heard sig¬ 
nal. On closer examination, this is 
necessary, since the sound pressure of 
the signals for which the ear is con¬ 
structed can vary over a wide expo¬ 
nential range. Here, a logarithmic 
measurement is an advantage. Firstly, 
an overload is prevented and secondly, 
the fact that the intensity measure¬ 
ment of intensive signals will be less 
precise, doesn’t matter as well. If a jet 
fighter is starting next to you, small 
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changes in the noise level can be ig¬ 
nored. 

Just to get a feeling for sensory organs 
and information processing in the organ¬ 
ism, we will briefly describe "usual" light 
sensing organs, i.e. organs often found in 
nature. For the third light sensing organ 
described below, the single lens eye, we 
will discuss the information processing in 
the eye. 


2.3.3 An outline of common light 
sensing organs 

For many organisms it turned out to be ex¬ 
tremely useful to be able to perceive elec¬ 
tromagnetic radiation in certain regions of 
the spectrum. Consequently, sensory or¬ 
gans have been developed which can de¬ 
tect such electromagnetic radiation and 
the wavelength range of the radiation per¬ 
ceivable by the human eye is called visible 
range or simply light. The different wave¬ 
lengths of this electromagnetic radiation 
are perceived by the human eye as differ¬ 
ent colors. The visible range of the elec¬ 
tromagnetic radiation is different for each 
organism. Some organisms cannot see the 
colors (—wavelength ranges) we can see, 
others can even perceive additional wave¬ 
length ranges (e.g. in the UV range). Be¬ 
fore we begin with the human being - in 
order to get a broader knowledge of the 
sense of sight- we briefly want to look at 
two organs of sight which, from an evolu¬ 
tionary point of view, exist much longer 
than the human. 


2.3.3.1 Compound eyes and pinhole 

eyes only provide high temporal 
or spatial resolution 


Let us first take a look at the so-called 

compound eye (Fig. 


2.5 on the next 


page), which is, for example, common in 
insects and crustaceans. The compound 
eye consists of a great number of small, 
individual eyes. If we look at the com¬ 
pound eye from the outside, the individ¬ 
ual eyes are clearly visible and arranged 
in a hexagonal pattern. Each individual 
eye has its own nerve fiber which is con¬ 
nected to the insect brain. Since the indi¬ 
vidual eyes can be distinguished, it is ob¬ 
vious that the number of pixels, i.e. the 
spatial resolution, of compound eyes must 
be very low and the image is blurred. But 
compound eyes have advantages, too, espe¬ 
cially for fast-flying insects. Certain com¬ 
pound eyes process more than 300 images 
per second (to the human eye, however, 
movies with 25 images per second appear 
as a fluent motion). 


Pinhole eyes are, for example, found in 
octopus species and work - as you can 
guess - similar to a pinhole camera. A 
pinhole eye has a very small opening for 
light entry, which projects a sharp image 
onto the sensory cells behind. Thus, the 
spatial resolution is much higher than in 
the compound eye. But due to the very 
small opening for light entry the resulting 
image is less bright. 


Compound eye 

high temp., 

low 

spatial 

resolution 


pinhole 
camera: 
high spat., 
low 

temporal 

resolution 
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2.3 Receptor cells 


Single 
lense eye: 
high temp, 
and spat, 
resolution 



Figure 2.5: Compound eye of a robber fly 


2.3.3.2 Single lens eyes combine the 
advantages of the other two 
eye types, but they are more 
complex 

The light sensing organ common in verte¬ 
brates is the single lense eye. The result¬ 
ing image is a sharp, high-resolution image 
of the environment at high or variable light 
intensity. On the other hand it is more 
complex. Similar to the pinhole eye the 
light enters through an opening (pupil) 
and is projected onto a layer of sensory 
cells in the eye. (retina). But in contrast 
to the pinhole eye, the size of the pupil can 
be adapted to the lighting conditions (by 
means of the iris muscle, which expands 
or contracts the pupil). These differences 
in pupil dilation require to actively focus 
the image. Therefore, the single lens eye 
contains an additional adjustable lens. 


2.3.3.3 The retina does not only 

receive information but is also 
responsible for information 
processing 

The light signals falling on the eye are 
received by the retina and directly pre- 
processed by several layers of information¬ 
processing cells. We want to briefly dis¬ 
cuss the different steps of this informa¬ 
tion processing and in doing so, we follow 
the way of the information carried by the 
light: 

Photoreceptors receive the light signal 
und cause action potentials (there 
are different receptors for different 
color components and light intensi¬ 
ties). These receptors are the real 
light-receiving part of the retina and 
they are sensitive to such an extent 
that only one single photon falling 
on the retina can cause an action po¬ 
tential. Then several photoreceptors 
transmit their signals to one single 

bipolar cell. This means that here the in¬ 
formation has already been summa¬ 
rized. Finally, the now transformed 
light signal travels from several bipo¬ 
lar cells 2 into 

ganglion cells. Various bipolar cells can 
transmit their information to one gan¬ 
glion cell. The higher the number 
of photoreceptors that affect the gan¬ 
glion cell, the larger the field of per¬ 
ception, the receptive field, which 
covers the ganglions - and the less 

2 There are different kinds of bipolar cells, as well, 
but to discuss all of them would go too far. 
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sharp is the image in the area of this 
ganglion cell. So the information is 
already reduced directly in the retina 
and the overall image is, for exam¬ 
ple, blurred in the peripheral field 
of vision. So far, we have learned 
about the information processing in 
the retina only as a top-down struc¬ 
ture. Now we want to take a look at 
the 

horizontal and amacrine cells. These 
cells are not connected from the 
front backwards but laterally. They 
allow the light signals to influence 
themselves laterally directly during 
the information processing in the 
retina - a much more powerful 
method of information processing 
than compressing and blurring. 
When the horizontal cells are excited 
by a photoreceptor, they are able to 
excite other nearby photoreceptors 
and at the same time inhibit more 
distant bipolar cells and receptors. 
This ensures the clear perception of 
outlines and bright points. Amacrine 
cells can further intensify certain 
stimuli by distributing information 
from bipolar cells to several ganglion 
cells or by inhibiting ganglions. 

These first steps of transmitting visual in¬ 
formation to the brain show that informa¬ 
tion is processed from the first moment the 
information is received and, on the other 
hand, is processed in parallel within mil¬ 
lions of information-processing cells. The 
system’s power and resistance to errors 
is based upon this massive division of 
work. 


2.4 The amount of neurons in 
living organisms at 
different stages of 
development 

An overview of different organisms and 

their neural capacity (in large part from 

RD05] ): 

302 neurons are required by the nervous 
system of a nematode worm , which 
serves as a popular model organism 
in biology. Nematodes live in the soil 
and feed on bacteria. 

10 4 neurons make an ant (To simplify 
matters we neglect the fact that some 
ant species also can have more or less 
efficient nervous systems). Due to the 
use of different attractants and odors, 
ants are able to engage in complex 
social behavior and form huge states 
with millions of individuals. If you re¬ 
gard such an ant state as an individ¬ 
ual, it has a cognitive capacity similar 
to a chimpanzee or even a human. 

With 10 5 neurons the nervous system of 
a fly can be constructed. A fly can 
evade an object in real-time in three- 
dimensional space, it can land upon 
the ceiling upside down, has a consid¬ 
erable sensory system because of com¬ 
pound eyes, vibrissae, nerves at the 
end of its legs and much more. Thus, 
a fly has considerable differential and 
integral calculus in high dimensions 
implemented "in hardware". We all 
know that a fly is not easy to catch. 
Of course, the bodily functions are 
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2.4 The amount of neurons in living organisms 


also controlled by neurons, but these 
should be ignored here. 

With 0.8 • 10 6 neurons we have enough 
cerebral matter to create a honeybee. 
Honeybees build colonies and have 
amazing capabilities in the field of 
aerial reconnaissance and navigation. 

4 • 10 6 neurons result in a mouse , and 

here the world of vertebrates already 
begins. 

1.5 • 10 7 neurons are sufficient for a rat , 
an animal which is denounced as be¬ 
ing extremely intelligent and are of¬ 
ten used to participate in a variety 
of intelligence tests representative for 
the animal world. Rats have an ex¬ 
traordinary sense of smell and orien¬ 
tation, and they also show social be¬ 
havior. The brain of a frog can be 
positioned within the same dimension. 
The frog has a complex build with 
many functions, it can swim and has 
evolved complex behavior. A frog 
can continuously target the said fly 
by means of his eyes while jumping 
in three-dimensional space and and 
catch it with its tongue with consid¬ 
erable probability. 

5 • 10 7 neurons make a bat. The bat can 

navigate in total darkness through a 
room, exact up to several centime¬ 
ters, by only using their sense of hear¬ 
ing. It uses acoustic signals to localize 
self-camouflaging insects (e.g. some 
moths have a certain wing structure 
that reflects less sound waves and the 
echo will be small) and also eats its 
prey while flying. 


1.6 • 10 8 neurons are required by the 
brain of a dog , companion of man for 
ages. Now take a look at another pop¬ 
ular companion of man: 

3 • 10 8 neurons can be found in a cat , 
which is about twice as much as in 
a dog. We know that cats are very 
elegant, patient carnivores that can 
show a variety of behaviors. By the 
way, an octopus can be positioned 
within the same magnitude. Only 
very few people know that, for exam¬ 
ple, in labyrinth orientation the octo¬ 
pus is vastly superior to the rat. 

For 6 • 10 9 neurons you already get a 
chimpanzee, one of the animals being 
very similar to the human. 

10 11 neurons make a human. Usually, 
the human has considerable cognitive 
capabilities, is able to speak, to ab¬ 
stract, to remember and to use tools 
as well as the knowledge of other hu¬ 
mans to develop advanced technolo¬ 
gies and manifold social structures. 

With 2 • 10 11 neurons there are nervous 
systems having more neurons than 
the human nervous system. Here we 
should mention elephants and certain 
whale species. 

Our state-of-the-art computers are not 
able to keep up with the aforementioned 
processing power of a fly. Recent research 
results suggest that the processes in ner¬ 
vous systems might be vastly more pow¬ 
erful than people thought until not long 
ago: Michaeva et al. describe a separate, 
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synapse-integrated information way of in¬ 
formation processing MBW + 10 . Poster¬ 


ity will show if they are right. 


2.5 Transition to technical 
neurons: neural networks 
are a caricature of biology 


How do we change from biological neural 
networks to the technical ones? Through 
radical simplification. I want to briefly 
summarize the conclusions relevant for the 
technical part: 

We have learned that the biological neu¬ 
rons are linked to each other in a weighted 
way and when stimulated they electrically 
transmit their signal via the axon. From 
the axon they are not directly transferred 
to the succeeding neurons, but they first 
have to cross the synaptic cleft where the 
signal is changed again by variable chem¬ 
ical processes. In the receiving neuron 
the various inputs that have been post- 
processed in the synaptic cleft are summa¬ 
rized or accumulated to one single pulse. 
Depending on how the neuron is stimu¬ 
lated by the cumulated input, the neuron 
itself emits a pulse or not - thus, the out¬ 
put is non-linear and not proportional to 
the cumulated input. Our brief summary 
corresponds exactly with the few elements 
of biological neural networks we want to 
take over into the technical approxima¬ 
tion: 


therefore it is a vector. In nature a 
neuron receives pulses of 10 3 to 10 4 
other neurons on average. 

Scalar output: The output of a neuron is 
a scalar, which means that the neu¬ 
ron only consists of one component. 
Several scalar outputs in turn form 
the vectorial input of another neuron. 
This particularly means that some¬ 
where in the neuron the various input 
components have to be summarized in 
such a way that only one component 
remains. 

Synapses change input: In technical neu¬ 
ral networks the inputs are prepro¬ 
cessed, too. They are multiplied by 
a number (the weight) - they are 
weighted. The set of such weights rep¬ 
resents the information storage of a 
neural network - in both biological 
original and technical adaptation. 

Accumulating the inputs: In biology, the 
inputs are summarized to a pulse ac¬ 
cording to the chemical change, i.e., 
they are accumulated - on the techni¬ 
cal side this is often realized by the 
weighted sum, which we will get to 
know later on. This means that after 
accumulation we continue with only 
one value, a scalar, instead of a vec¬ 
tor. 

Non-linear characteristic: The input of 
our technical neurons is also not pro¬ 
portional to the output. 


Vectorial input: The input of technical 
neurons consists of many components, 


Adjustable weights: The weights weight¬ 
ing the inputs are variable, similar to 
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2.5 Technical neurons as caricature of biology 


the chemical processes at the synap¬ 
tic cleft. This adds a great dynamic 
to the network because a large part of 
the "knowledge" of a neural network is 
saved in the weights and in the form 
and power of the chemical processes 
in a synaptic cleft. 

So our current, only casually formulated 
and very simple neuron model receives a 
vectorial input 

x, 

with components Xj. These are multiplied 
by the appropriate weights Wi and accumu¬ 
lated: 

i 

The aforementioned term is called 
weighted sum. Then the nonlinear 
mapping / defines the scalar output y: 



After this transition we now want to spec¬ 
ify more precisely our neuron model and 
add some odds and ends. Afterwards we 
will take a look at how the weights can be 
adjusted. 

Exercises 

Exercise 4. It is estimated that a hu¬ 
man brain consists of approx. 10 11 nerve 
cells, each of which has about 10 3 to 10 4 
synapses. For this exercise we assume 10 3 
synapses per neuron. Let us further as¬ 
sume that a single synapse could save 4 


bits of information. Naively calculated: 
How much storage capacity does the brain 
have? Note: The information which neu¬ 
ron is connected to which other neuron is 
also important. 
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Chapter 3 

Components of artificial neural networks 


Formal definitions and colloquial explanations of the components that realize 
the technical adaptations of biological neural networks. Initial descriptions of 
how to combine these components into a neural network. 


This chapter contains the formal defini¬ 
tions for most of the neural network com¬ 
ponents used later in the text. After this 
chapter you will be able to read the indi¬ 
vidual chapters of this work without hav¬ 
ing to know the preceding ones (although 
this would be useful). 


3.1 The concept of time in 
neural networks 


certain point in time, the notation will be, 
for example, net j(t — 1) or o,(t). 


From a biological point of view this is, of 
course, not very plausible (in the human 
brain a neuron does not wait for another 
one), but it significantly simplifies the im¬ 
plementation. 


discrete 
time steps 


(*) 


In some definitions of this text we use the 
term time or the number of cycles of the 
neural network, respectively. Time is di¬ 
vided into discrete time steps: 

Definition 3.1 (The concept of time). 
The current time (present time) is referred 
to as (t), the next time step as (t + 1), 
the preceding one as (t — 1). All other 
time steps are referred to analogously. If in 
the following chapters several mathemati¬ 
cal variables (e.g. netj or o,;) refer to a 


3.2 Components of neural 
networks 


A technical neural network consists of sim¬ 
ple processing units, the neurons, and 
directed, weighted connections between 
those neurons. Here, the strength of a 
connection (or the connecting weight) be- 
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n. network 
= neurons 
+ weighted 
connection 



w 


tween two neurons i and j is referred to as 
w i:j 

Definition 3.2 (Neural network). A 
neural network is a sorted triple 
(N, V, w) with two sets N, V and a func¬ 
tion w, where N is the set of neurons and 
V a set {(i, j)\i. j G N} whose elements are 
called connections between neuron i and 
neuron j. The function w : V —f M defines 
the weights, where u>((z, j)), the weight of 
the connection between neuron i and neu¬ 
ron j, is shortened to Wij . Depending on 
the point of view it is either undefined or 
0 for connections that do not exist in the 
network. 

SIMIPE: In Snipe, an instance of the class 
NeuralNetworkDescriptor is created in 
the first place. The descriptor object 
roughly outlines a class of neural networks, 
e.g. it defines the number of neuron lay¬ 
ers in a neural network. In a second step, 
the descriptor object is used to instantiate 
an arbitrary number of NeuralNetwork ob¬ 
jects. To get started with Snipe program¬ 
ming, the documentations of exactly these 
two classes are - in that order - the right 
thing to read. The presented layout involv¬ 
ing descriptor and dependent neural net¬ 
works is very reasonable from the imple¬ 
mentation point of view, because it is en¬ 
ables to create and maintain general param¬ 
eters of even very large sets of similar (but 
not neccessarily equal) networks. 

So the weights can be implemented in a 
square weight matrix W or, optionally, 
in a weight vector W with the row num- 

1 Note: In some of the cited literature i and j could 
be interchanged in Wij. Here, a consistent stan¬ 
dard does not exist. But in this text I try to use 
the notation I found more frequently and in the 
more significant citations. 


ber of the matrix indicating where the con¬ 
nection begins, and the column number of 
the matrix indicating, which neuron is the 
target. Indeed, in this case the numeric 
0 marks a non-existing connection. This 
matrix representation is also called Hin¬ 
ton diagram 2 . 

The neurons and connections comprise the 
following components and variables (I’m 
following the path of the data within a 
is according to fig. |3.1 on| 
in top-down direction): 


3.2.1 Connections carry information 
that is processed by neurons 

Data are transferred between neurons via 
connections with the connecting weight be¬ 
ing either excitatory or inhibitory. The 
definition of connections has already been 
included in the definition of the neural net¬ 
work. 

SNIPE: Connection weights 

can be set using the method 
NeuralNetwork.setSynapse. 


3.2.2 The propagation function 
converts vector inputs to 
scalar network inputs 

Looking at a neuron j , we will usually find 
a lot of neurons with a connection to j, i.e. 
which transfer their output to j. 

2 Note that, here again, in some of the cited liter¬ 
ature axes and rows could be interchanged. The 
published literature is not consistent here, as well. 


neuron, which 
the facing page 
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3.2 Components of neural networks 



Activation function 

(Transforms net input and sometimes 
old activation to new activation) 



For a neuron j the propagation func¬ 
tion receives the outputs ,..., Oi n of 
other neurons ii, * 2 , • . •, i n (which are con¬ 
nected to j), and transforms them in con¬ 
sideration of the connecting weights Wij 
into the network input netj that can be fur¬ 
ther processed by the activation function. 
Thus, the network input is the result of 
the propagation function. 

Definition 3.3 (Propagation func¬ 
tion and network input). Let 

I = • • • :*n} be the set of neurons, 

such that Mz £ {l,...,n} : 3wi z j. Then 
the network input of j, called netj, is 
calculated by the propagation function 
/prop as follows: 

netj = /prop(°ii) • ■ • ) ) w ii,j ) ■ ■ ■ : w i n ,j ) 

(3.1) 

Here the weighted sum is very popular: 
The multiplication of the output of each 
neuron i by Wij, and the summation of 
the results: 

netj = ■ w^j) (3.2) 

i&I 


Figure 3.1: Data processing of a neuron. The 
activation function of a neuron implies the 
threshold value. 


SIMIPE: The propagation function in 
Snipe was implemented using the weighted 
sum. 


3.2.3 The activation is the 

"switching status" of a 
neuron 

Based on the model of nature every neuron 
is, to a certain extent, at all times active, 
excited or whatever you will call it. The 


manages 

inputs 
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How active 
is a 
neuron? 


reactions of the neurons to the input val¬ 
ues depend on this activation state. The 
activation state indicates the extent of a 
neuron’s activation and is often shortly re¬ 
ferred to as activation. Its formal defini¬ 
tion is included in the following definition 
of the activation function. But generally, 
it can be defined as follows: 

Definition 3.4 (Activation state / activa¬ 
tion in general). Let j be a neuron. The 
activation state aj, in short activation, is 
explicitly assigned to j, indicates the ex¬ 
tent of the neuron’s activity and results 
from the activation function. 


3.2.5 The activation function 

determines the activation of a 
neuron dependent on network 
input and treshold value 

At a certain time - as we have already 
learned - the activation aj of a neuron j 
depends on the previous 3 activation state 
of the neuron and the external input. 

Definition 3.6 (Activation function and 
Activation). Let j be a neuron. The ac¬ 
tivation function is defined as 

Gjif) — /act(net j (f), aj(t 1), ©j)• (3-3) 


SNIPE: It is possible to get and set activa¬ 
tion states of neurons by using the meth¬ 
ods getActivation or setActivation in 
the class NeuralNetwork. 


It transforms the network input net.,-, 
as well as the previous activation state 
aj(t — 1) into a new activation state aj(t ), 
with the threshold value 0 playing an im¬ 
portant role, as already mentioned. 


3.2.4 Neurons get activated if the 
network input exceeds their 
treshold value 

Near the threshold value, the activation 
function of a neuron reacts particularly 
sensitive. From the biological point of 
view the threshold value represents the 
threshold at which a neuron starts fir¬ 
ing. The threshold value is also mostly 

highest ° J 

point of included in the definition of the activation 
sensation function, but generally the definition is the 
following: 

Definition 3.5 (Threshold value in gen¬ 
eral). Let j be a neuron. The threshold 
value Qj is uniquely assigned to j and 
marks the position of the maximum gradi¬ 
ent value of the activation function. 


Unlike the other variables within the neu¬ 
ral network (particularly unlike the ones 
defined so far) the activation function is 
often defined globally for all neurons or 
at least for a set of neurons and only the 
threshold values are different for each neu¬ 
ron. We should also keep in mind that 
the threshold values can be changed, for 
example by a learning procedure. So it 
can in particular become necessary to re¬ 
late the threshold value to the time and to 
write, for instance Qj as Qj(t) (but for rea¬ 
sons of clarity, I omitted this here). The 
activation function is also called transfer 
function. 

3 The previous activation is not always relevant for 
the current - we will see examples for both vari¬ 
ants. 
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3.2 Components of neural networks 


SNIPE: In Snipe, activation functions are 
generalized to neuron behaviors. Such 
behaviors can represent just normal acti¬ 
vation functions, or even incorporate in¬ 
ternal states and dynamics. Correspond¬ 
ing parts of Snipe can be found in the 
package neuronbehavior, which also con¬ 
tains some of the activation functions in¬ 
troduced in the next section. The inter¬ 
face NeuronBehavior allows for implemen¬ 
tation of custom behaviors. Objects that 
inherit from this interface can be passed to 
a NeuralNetworkDescriptor instance. It 
is possible to define individual behaviors 
per neuron layer. 


3.2.6 Common activation functions 


The simplest activation function is the bi¬ 
nary threshold function (fig. |3.2 on the 
next page I, which can only take on two val¬ 


ues (also referred to as Heaviside func¬ 
tion). If the input is above a certain 
threshold, the function changes from one 
value to another, but otherwise remains 
constant. This implies that the function 
is not differentiable at the threshold and 
for the rest the derivative is 0. Due to 
this fact, backpropagation learning, for ex¬ 
ample, is impossible (as we will see later). 
Also very popular is the Fermi function 


or logistic function (fig. 3.2) 


1 


1 + e“ 


(3.4) 


which maps to the range of values of (0,1) 
and the hyperbolic tangent (fig. 3.2) 


which maps to (—1,1). Both functions are 
differentiable. The Fermi function can be 
expanded by a temperature parameter 
T into the form 


-(3-5) 

1 + e t 


The smaller this parameter, the more does 
it compress the function on the x axis. 
Thus, one can arbitrarily approximate the 
Heaviside function. Incidentally, there ex¬ 
ist activation functions which are not ex¬ 
plicitly defined but depend on the input ac¬ 
cording to a random distribution ( stochas¬ 
tic activation function). 


A alternative to the hypberbolic tangent 
that is really worth mentioning was sug¬ 
gested by Anguita et al. |APZ93 , who 
have been tired of the slowness of the work¬ 
stations back in 1993. Thinking about 
how to make neural network propagations 
faster, they quickly identified the approx¬ 
imation of the e-function used in the hy¬ 
perbolic tangent as one of the causes of 
slowness. Consequently, they "engineered" 
an approximation to the hyperbolic tan¬ 
gent, just using two parabola pieces and 
two half-lines. At the price of delivering 
a slightly smaller range of values than the 
hyperbolic tangent ([—0.96016; 0.96016] in¬ 
stead of [—1; 1]), dependent on what CPU 
one uses, it can be calculated 200 times 
faster because it just needs two multipli¬ 
cations and one addition. What’s more, 
it has some other advantages that will be 
mentioned later. 


SNIPE: The activation functions intro¬ 
duced here are implemented within the 
classes Fermi and TangensHyperbolicus, 
both of which are located in the package 
neuronbehavior. The fast hyperbolic tan¬ 
gent approximation is located within the 
class TangensHyperbolicusAnguita. 
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Heaviside Function 



x 


Fermi Function with Temperature Parameter 



x 


Hyperbolic Tangent 



x 


3.2.7 An output function may be 

used to process the activation 
once again 

The output function of a neuron j cal¬ 
culates the values which are transferred to 
the other neurons connected to j. More 
formally: 

Definition 3.7 (Output function). Let j . 
be a neuron. The output function other 

neurons 

./out ( Cj ) = Oj (3.6) 

calculates the output value Oj of the neu¬ 
ron j from its activation state aj. 

Generally, the output function is defined 
globally, too. Often this function is the 
identity , i.e. the activation aj is directly 
output 4 : 


/out(Qj) — SO Oj — Oj (3.7) 

Unless explicitly specified differently, we 
will use the identity as output function 
within this text. 


3.2.8 Learning strategies adjust a 
network to fit our needs 


Figure 3.2: Various popular activation func¬ 
tions, from top to bottom: Heaviside or binary 
threshold function, Fermi function, hyperbolic 
tangent. The Fermi function was expanded by 
a temperature parameter. The original Fermi 
function is represented by dark colors, the tem¬ 
perature parameters of the modified Fermi func¬ 
tions are, ordered ascending by steepness, f, 


Since we will address this subject later in 
detail and at first want to get to know the 
principles of neural network structures, I 
will only provide a brief and general defi¬ 
nition here: 

4 Other definitions of output functions may be use¬ 
ful if the range of values of the activation function 
is not sufficient. 
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3.3 Network topologies 


Definition 3.8 (General learning rule). 
The learning strategy is an algorithm 
that can be used to change and thereby 
train the neural network, so that the net¬ 
work produces a desired output for a given 
input. 


3.3 Network topologies 


After we have become acquainted with the 
composition of the elements of a neural 
network, I want to give an overview of 
the usual topologies (= designs) of neural 
networks, i.e. to construct networks con¬ 
sisting of these elements. Every topology 
described in this text is illustrated by a 
map and its Hinton diagram so that the 
reader can immediately see the character¬ 
istics and apply them to other networks. 

In the Hinton diagram the dotted weights 
are represented by light grey fields, the 
solid ones by dark grey fields. The input 
and output arrows, which were added for 
reasons of clarity, cannot be found in the 
Hinton diagram. In order to clarify that 
the connections are between the line neu¬ 
rons and the column neurons, I have in¬ 
serted the small arrow f* in the upper-left 
cell. 

SNIPE: Snipe is designed for realization 
of arbitrary network topologies. In this 
respect, Snipe defines different kinds of 
synapses depending on their source and 
their target. Any kind of synapse can sep¬ 
arately be allowed or forbidden for a set of 
networks using the setAllowed methods in 
a NeuralNetworkDescriptor instance. 


3.3.1 Feedforward networks consist 
of layers and connections 
towards each following layer 


Feedforward In this text feedforward net¬ 


works (fig. 3.3 on the following page) are 


the networks we will first explore (even if 
we will use different topologies later). The 
neurons are grouped in the following lay¬ 
ers: One input layer , n hidden pro¬ 
cessing layers (invisible from the out¬ 
side, that’s why the neurons are also re¬ 
ferred to as hidden neurons ) and one out¬ 
put layer. In a feedforward network each 
neuron in one layer has only directed con¬ 
nections to the neurons of the next layer 
(towards the output layer). In fig. 3.3 on 
|the next page the connections permitted 


for a feedforward network are represented 
by solid lines. We will often be confronted 
with feedforward networks in which every 
neuron i is connected to all neurons of the 
next layer (these layers are called com¬ 
pletely linked). To prevent naming con¬ 
flicts the output neurons are often referred 
to as IT 


Definition 3.9 (Feedforward network). 
The neuron layers of a feedforward net¬ 
work (fig. |3.3 on the following page ) are 
clearly separated: One input layer, one 
output layer and one or more processing 
layers which are invisible from the outside 
(also called hidden layers). Connections 
are only permitted to neurons of the fol¬ 
lowing layer. 


network of 
layers 
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3.3.1.1 Shortcut connections skip layers 


Some feedforward networks permit the so- 



called shortcut connections (fig. 3.4 on 


the next page): connections that skip one 
or more levels. These connections may 
only be directed towards the output layer, 
too. 


Definition 3.10 (Feedforward network 
with shortcut connections). Similar to the 
feedforward network, but the connections 
V^yiay not only be directed towards the next 
XCldayer but also towards any other subse¬ 
quent layer. 


3.3.2 Recurrent networks have 
influence on themselves 


Shortcuts 

skip 

layers 



Recurrence is defined as the process of a 
neuron influencing itself by any means or 
by any connection. Recurrent networks do 
not always have explicitly defined input or 
output neurons. Therefore in the figures 
I omitted all markings that concern this 
matter and only numbered the neurons. 


Figure 3.3: A feedforward network with three 
layers: two input neurons, three hidden neurons 
and two output neurons. Characteristic for the 
Hinton diagram of completely linked feedforward 
networks is the formation of blocks above the 
diagonal. 


3.3.2.1 Direct recurrences start and 
end at the same neuron 

Some networks allow for neurons to be 
connected to themselves, which is called 
direct recurrence (or sometimes self¬ 
recurrence (fig. 3.5 on the facing page). 
As a result, neurons inhibit and therefore 
strengthen themselves in order to reach 
their activation limits. 


3.5 on the facing page 
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3.3 Network topologies 
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Figure 3.5: A network similar to a feedforward 
network with directly recurrent neurons. The di¬ 
rect recurrences are represented by solid lines and 
exactly correspond to the diagonal in the Hinton 
diagram matrix. 


Figure 3.4: A feedforward network with short¬ 
cut connections, which are represented by solid 
lines. On the right side of the feedforward blocks 
new connections have been added to the Hinton 
diagram. 


Definition 3.11 (Direct recurrence). 
Now we expand the feedforward network 
by connecting a neuron j to itself, with the 
weights of these connections being referred 
to as Wj j. In other words: the diagonal 
of the weight matrix W may be different 
from 0. 


neurons 

influence 

themselves 
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3.3.2.2 Indirect recurrences can 

influence their starting neuron 
only by making detours 


If connections are allowed towards the in¬ 
put layer, they will be called indirect re¬ 
currences. Then a neuron j can use in¬ 
direct forwards connections to influence it¬ 
self, for example, by influencing the neu¬ 
rons of the next layer and the neurons of 
this next layer influencing j (fig. 3.6). 


Definition 3.12 (Indirect recurrence). 
Again our network is based on a feedfor¬ 
ward network, now with additional connec¬ 
tions between neurons and their preceding 
layer being allowed. Therefore, below the 
diagonal of W is different from 0. 


3.3.2.3 Lateral recurrences connect 
neurons within one layer 


Connections between neurons within one 
layer are called lateral recurrences 
(fig. 3.7 on the facing page). Here, each 
neuron often inhibits the other neurons of 
the layer and strengthens itself. As a re¬ 
sult only the strongest neuron becomes ac¬ 
tive ( winner-takes-all scheme ). 


Definition 3.13 (Lateral recurrence). A 
laterally recurrent network permits con¬ 
nections within one layer. 




Figure 3.6: A network similar to a feedforward 
network with indirectly recurrent neurons. The 
indirect recurrences are represented by solid lines. 
As we can see, connections to the preceding lay¬ 
ers can exist here, too. The fields that are sym¬ 
metric to the feedforward blocks in the Hinton 
diagram are now occupied. 


3.3.3 Completely linked networks 
allow any possible connection 

Completely linked networks permit connec¬ 
tions between all neurons, except for direct 
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3.4 The bias neuron 


recurrences. Furthermore, the connections 


must be symmetric (fig. 3.8 on the next 


page). A popular example are the self¬ 


organizing maps , which will be introduced 


in chapter 10 



Definition 3.14 (Complete interconnec¬ 
tion). In this case, every neuron is always 
allowed to be connected to every other neu¬ 
ron - but as a result every neuron can 
become an input neuron. Therefore, di¬ 
rect recurrences normally cannot be ap¬ 
plied here and clearly defined layers do not 
longer exist. Thus, the matrix W may be 
unequal to 0 everywhere, except along its 
diagonal. 



Figure 3.7: A network similar to a feedforward 
network with laterally recurrent neurons. The 
direct recurrences are represented by solid lines. 
Here, recurrences only exist within the layer. 
In the Hinton diagram, filled squares are con¬ 
centrated around the diagonal in the height of 
the feedforward blocks, but the diagonal is left 
uncovered. 


3.4 The bias neuron is a 

technical trick to consider 
threshold values as 
connection weights 

By now we know that in many network 
paradigms neurons have a threshold value 
that indicates when a neuron becomes ac¬ 
tive. Thus, the threshold value is an 
activation function parameter of a neu¬ 
ron. From the biological point of view 
this sounds most plausible, but it is com¬ 
plicated to access the activation function 
at runtime in order to train the threshold 
value. 


But threshold values ©j,,..., @j n for neu¬ 
rons j i, .72 , ■ ■ ■ ,jn can also be realized as 
connecting weight of a continuously fir¬ 
ing neuron: For this purpose an addi¬ 
tional bias neuron whose output value 
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is always 1 is integrated in the network 
and connected to the neurons Ji, J 2 , ■ ■ ■, jn- 
These new connections get the weights 
— Oji ,..., —&j n , i.e. they get the negative 
threshold values. 


Definition 3.15. A bias neuron is a 

neuron whose output value is always 1 and 
which is represented by 



Figure 3.8: A completely linked network with 
symmetric connections and without direct recur¬ 
rences. In the Hinton diagram only the diagonal 
is left blank. 



It is used to represent neuron biases as con¬ 
nection weights, which enables any weight¬ 
training algorithm to train the biases at 
the same time. 


Then the threshold value of the neurons 
j i, j' 2 ,... ,j n is set to 0. Now the thresh¬ 
old values are implemented as connection 
weights (fig. 3.9 on page 46) and can di¬ 
rectly be trained together with the con¬ 
nection weights, which considerably facil¬ 
itates the learning process. 


In other words: Instead of including the 
threshold value in the activation function, 
it is now included in the propagation func¬ 
tion. Or even shorter: The threshold value 
is subtracted from the network input, i.e. 
it is part of the network input. More for¬ 
mally: 


Let ji. j' 2 , ■■■, j n be neurons with thresh¬ 
old values 0 j ± ,, Qj n . By inserting a 
bias neuron whose output value is always 
1, generating connections between the said 
bias neuron and the neurons j \, j 2 ,..., j n 
and weighting these connections 
w’biasji , • • •, wbiasJ„ with -0j,, • • ■, ~@j n , 
we can set 0j, = ... = Qj n = 0 and 


bias neuron 
replaces 
thresh, value 
with weights 
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3.6 Orders of activation 


receive an equivalent neural network 
whose threshold values are realized by 
connection weights. 

Undoubtedly, the advantage of the bias 
neuron is the fact that it is much easier 
to implement it in the network. One dis¬ 
advantage is that the representation of the 
network already becomes quite ugly with 
only a few neurons, let alone with a great 
number of them. By the way, a bias neu¬ 
ron is often referred to as on neuron. 





Figure 3.10: Different types of neurons that will 
appear in the following text. 


From now on, the bias neuron is omit¬ 
ted for clarity in the following illustrations, 
but we know that it exists and that the 
threshold values can simply be treated as 
weights because of it. 

SNIPE: In Snipe, a bias neuron was imple¬ 
mented instead of neuron-individual biases. 
The neuron index of the bias neuron is 0. 


3.6 Take care of the order in 
which neuron activations 
are calculated 

For a neural network it is very important 
in which order the individual neurons re¬ 
ceive and process the input and output the 
results. Here, we distinguish two model 
classes: 


3.5 Representing neurons 


3.6.1 Synchronous activation 


We have already seen that we can either 
write its name or its threshold value into 
a neuron. Another useful representation, 
which we will use several times in the 
following, is to illustrate neurons accord¬ 
ing to their type of data processing. See 
fig. 3.10 for some examples without fur¬ 
ther explanation - the different types of 
neurons are explained as soon as we need 
them. 


All neurons change their values syn¬ 
chronously, i.e. they simultaneously cal¬ 
culate network inputs, activation and out¬ 
put, and pass them on. Synchronous ac¬ 
tivation corresponds closest to its biolog¬ 
ical counterpart, but it is - if to be im¬ 
plemented in hardware - only useful on 
certain parallel computers and especially 
not for feedforward networks. This order 
of activation is the most generic and can 
be used with networks of arbitrary topol¬ 
ogy. 
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biologically 

plausible 




Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with bias 
neuron on the right. The neuron threshold values can be found in the neurons, the connecting 
weights at the connections. Furthermore, I omitted the weights of the already existing connections 
(represented by dotted lines on the right side). 


Definition 3.16 (Synchronous activa¬ 
tion). All neurons of a network calculate 
network inputs at the same time by means 
of the propagation function, activation by 
means of the activation function and out¬ 
put by means of the output function. Af¬ 
ter that the activation cycle is complete. 

SNIPE: When implementing in software, 
one could model this very general activa¬ 
tion order by every time step calculating 
and caching every single network input, 
and after that calculating all activations. 
This is exactly how it is done in Snipe, be¬ 
cause Snipe has to be able to realize arbi¬ 
trary network topologies. 


3.6.2 Asynchronous activation 

Here, the neurons do not change their val¬ 
ues simultaneously but at different points 


of time. For this, there exist different or¬ 
ders, some of which I want to introduce in 
the following: 


3.6.2.1 Random order 

Definition 3.17 (Random order of acti¬ 
vation). With random order of acti¬ 
vation a neuron i is randomly chosen and 
its neti, at and o* are updated. For n neu¬ 
rons a cycle is the n-fold execution of this 
step. Obviously, some neurons are repeat¬ 
edly updated during one cycle, and others, 
however, not at all. 


Apparently, this order of activation is not 
always useful. 


easier to 
implement 
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3.6 Orders of activation 


often very 
useful 


3.6.2.2 Random permutation 

With random permutation each neuron 
is chosen exactly once, but in random or¬ 
der, during one cycle. 

Definition 3.18 (Random permutation). 
Initially, a permutation of the neurons is 
calculated randomly and therefore defines 
the order of activation. Then the neurons 
are successively processed in this order. 

This order of activation is as well used 
rarely because firstly, the order is gener¬ 
ally useless and, secondly, it is very time- 
consuming to compute a new permutation 
for every cycle. A Hopfield network (chap¬ 
ter [8]) is a topology nominally having a 
random or a randomly permuted order of 
activation. But note that in practice, for 
the previously mentioned reasons, a fixed 
order of activation is preferred. 

For all orders either the previous neuron 
activations at time t or, if already existing, 
the neuron activations at time t + 1, for 
which we are calculating the activations, 
can be taken as a starting point. 

3.6.2.3 Topological order 

Definition 3.19 (Topological activation). 
With topological order of activation 
the neurons are updated during one cycle 
and according to a fixed order. The order 
is defined by the network topology. 

This procedure can only be considered for 
non-cyclic, i.e. non-recurrent, networks, 


since otherwise there is no order of activa¬ 
tion. Thus, in feedforward networks (for 
which the procedure is very reasonable) 
the input neurons would be updated first, 
then the inner neurons and finally the out¬ 
put neurons. This may save us a lot of 
time: Given a synchronous activation or¬ 
der, a feedforward network with n layers 
of neurons would need n full propagation 
cycles in order to enable input data to 
have influence on the output of the net¬ 
work. Given the topological activation or¬ 
der, we just need one single propagation. 
However, not every network topology al¬ 
lows for finding a special activation order 
that enables saving time. 

SIMIPE: Those who want to use Snipe 
for implementing feedforward networks 
may save some calculation time by us¬ 
ing the feature fastprop (mentioned 
within the documentation of the class 
NeuralNetworkDescriptor. Once fastprop 
is enabled, it will cause the data propaga¬ 
tion to be carried out in a slightly different 
way. In the standard mode, all net inputs 
are calculated first, followed by all activa¬ 
tions. In the fastprop mode, for every neu¬ 
ron, the activation is calculated right after 
the net input. The neuron values are calcu¬ 
lated in ascending neuron index order. The 
neuron numbers are ascending from input 
to output layer, which provides us with the 
perfect topological activation order for feed¬ 
forward networks. 


3.6.2.4 Fixed orders of activation 
during implementation 

Obviously, fixed orders of activation 
can be defined as well. Therefore, when 
implementing, for instance, feedforward 
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networks it is very popular to determine 
the order of activation once according to 
the topology and to use this order without 
further verification at runtime. But this is 
not necessarily useful for networks that are 
capable to change their topology. 


3.7 Communication with the 
outside world: input and 
output of data in and 
from neural networks 


Finally, let us take a look at the fact that, 
of course, many types of neural networks 
permit the input of data. Then these data 
are processed and can produce output. 
Let us, for example, regard the feedfor¬ 


ward network shown in fig. 3.3 on page 40 


It has two input neurons and two output 
neurons, which means that it also has two 
numerical inputs xi,X 2 and outputs y \, y 2 . 
As a simplification we summarize the in¬ 
put and output components for n input 
or output neurons within the vectors x = 
(xi,x 2 ,. ■ • ,x n ) and y = (yi,y 2 , ■ ■ ■,y n )■ 


Definition 3.20 (Input vector). A net¬ 
work with n input neurons needs n inputs 
xi, x 2 ,..., x n . They are considered as in¬ 
put vector x = (xi, x 2 ,..., x n ). As a 
consequence, the input dimension is re¬ 
ferred to as n. Data is put into a neural 
network by using the components of the in¬ 
put vector as network inputs of the input 
neurons. 


Definition 3.21 (Output vector). A net¬ 
work with m output neurons provides rn 


outputs yi, y 2 ,..., y m . They are regarded 
as output vector y = (yi, y 2 , ..., y m ). 
Thus, the output dimension is referred 
to as m. Data is output by a neural net¬ 
work by the output neurons adopting the 
components of the output vector in their 
output values. 


SNIPE: In order to propagate data through 
a NeuralNetwork-instance, the propagate 
method is used. It receives the input vector 
as array of doubles, and returns the output 
vector in the same way. 


Now we have defined and closely examined 
the basic components of neural networks - 
without having seen a network in action. 
But first we will continue with theoretical 
explanations and generally describe how a 
neural network could learn. 


Exercises 

Exercise 5. Would it be useful (from 
your point of view) to insert one bias neu¬ 
ron in each layer of a layer-based network, 
such as a feedforward network? Discuss 
this in relation to the representation and 
implementation of the network. Will the 
result of the network change? 

Exercise 6. Show for the Fermi function 
fix) as well as for the hyperbolic tangent 
tanh(x), that their derivatives can be ex¬ 
pressed by the respective functions them¬ 
selves so that the two statements 

1- fix) = fix) • (1 - fix)) and 

2. tanh 7 (x) = 1 — tanh 2 (x) 
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3.7 Input and output of data 


are true. 
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Chapter 4 

Fundamentals on learning and training 
samples 


Approaches and thoughts of how to teach machines. Should neural networks 
be corrected? Should they only be encouraged? Or should they even learn 
without any help? Thoughts about what we want to change during the 
learning procedure and how we will change it, about the measurement of 

errors and when we have learned enough. 


From what 
do we learn? 


As written above, the most interesting 
characteristic of neural networks is their 
capability to familiarize with problems 
by means of training and, after sufficient 
training, to be able to solve unknown prob¬ 
lems of the same class. This approach is re¬ 
ferred to as generalization. Before intro¬ 
ducing specific learning procedures, I want 
to propose some basic principles about the 
learning procedure in this chapter. 

4.1 There are different 
paradigms of learning 

Learning is a comprehensive term. A 
learning system changes itself in order to 
adapt to e.g. environmental changes. A 
neural network could learn from many 
things but, of course, there will always be 


the question of how to implement it. In 
principle, a neural network changes when 
its components are changing, as we have 
learned above. Theoretically, a neural net¬ 
work could learn by 

1. developing new connections, 

2. deleting existing connections, 

3. changing connecting weights, 

4. changing the threshold values of neu¬ 
rons, 

5. varying one or more of the three neu¬ 
ron functions (remember: activation 
function, propagation function and 
output function), 

6 . developing new neurons, or 

7. deleting existing neurons (and so, of 
course, existing connections). 
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As mentioned above, we assume the 
change in weight to be the most common 
procedure. Furthermore, deletion of con¬ 
nections can be realized by additionally 
taking care that a connection is no longer 
trained when it is set to 0. Moreover, we 
can develop further connections by setting 
a non-existing connection (with the value 
0 in the connection matrix) to a value dif¬ 
ferent from 0. As for the modification of 
threshold values I refer to the possibility 
of implementing them as weights (section 
. Thus, we perform any of the first four 
of the learning paradigms by just training 
synaptic weights. 

The change of neuron functions is difficult 
to implement, not very intuitive and not 
exactly biologically motivated. Therefore 
it is not very popular and I will omit this 
topic here. The possibilities to develop or 
delete neurons do not only provide well 
adjusted weights during the training of a 
neural network, but also optimize the net¬ 
work topology. Thus, they attract a grow¬ 
ing interest and are often realized by using 
evolutionary procedures. But, since we ac¬ 
cept that a large part of learning possibil¬ 
ities can already be covered by changes in 
weight, they are also not the subject mat¬ 
ter of this text (however, it is planned to 
extend the text towards those aspects of 
training). 

SNIPE: Methods of the class 

NeuralNetwork allow for changes in 
connection weights, and addition and 
removal of both connections and neurons. 
Methods in NeuralNetworkDescriptor 
enable the change of neuron behaviors, 
respectively activation functions per 
layer. 


Thus, we let our neural network learn by 
modifying the connecting weights accord¬ 
ing to rules that can be formulated as al¬ 
gorithms. Therefore a learning procedure 
is always an algorithm that can easily be 
implemented by means of a programming 
language. Later in the text I will assume 
that the definition of the term desired out¬ 
put which is worth learning is known (and 
I will define formally what a training pat¬ 
tern is) and that we have a training set 
of learning samples. Let a training set be 
defined as follows: 

Definition 4.1 (Training set). A. train¬ 
ing set (named P) is a set of training 
patterns, which we use to train our neu¬ 
ral net. 

I will now introduce the three essential 
paradigms of learning by presenting the 
differences between their regarding train¬ 
ing sets. 

4.1.1 Unsupervised learning 

provides input patterns to the 
network, but no learning aides 

Unsupervised learning is the biologi¬ 
cally most plausible method, but is not 
suitable for all problems. Only the in¬ 
put patterns are given; the network tries 
to identify similar patterns and to classify 
them into similar categories. 

Definition 4.2 (Unsupervised learning). 

The training set only consists of input 
patterns, the network tries by itself to de¬ 
tect similarities and to generate pattern 
classes. 


Learning 
by changes 
in weight 


p 
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4.1 Paradigms of learning 


network 
receives 
reward or 
punishment 


network 
receives 
correct 
results for 
samples 


Here I want to refer again to the popu¬ 
lar example of Kohonen’s self-organising 
maps (chapter [Toj) . 

4.1.2 Reinforcement learning 
methods provide feedback to 
the network, whether it 
behaves well or bad 

In reinforcement learning the network 
receives a logical or a real value after 
completion of a sequence, which defines 
whether the result is right or wrong. Intu¬ 
itively it is clear that this procedure should 
be more effective than unsupervised learn¬ 
ing since the network receives specific crit- 
era for problem-solving. 

Definition 4.3 (Reinforcement learning). 
The training set consists of input patterns, 
after completion of a sequence a value is re¬ 
turned to the network indicating whether 
the result was right or wrong and, possibly, 
how right or wrong it was. 

4.1.3 Supervised learning methods 
provide training patterns 
together with appropriate 
desired outputs 

In supervised learning the training set 
consists of input patterns as well as their 
correct results in the form of the precise ac¬ 
tivation of all output neurons. Thus, for 
each training set that is fed into the net¬ 
work the output, for instance, can directly 
be compared with the correct solution and 
and the network weights can be changed 


according to their difference. The objec¬ 
tive is to change the weights to the effect 
that the network cannot only associate in¬ 
put and output patterns independently af¬ 
ter the training, but can provide plausible 
results to unknown, similar input patterns, 
i.e. it generalises. 

Definition 4.4 (Supervised learning). 
The training set consists of input patterns 
with correct results so that the network can 
receive a precise error vector 1 can be re¬ 
turned. 

This learning procedure is not always bio¬ 
logically plausible, but it is extremely ef¬ 
fective and therefore very practicable. 

At first we want to look at the the su¬ 
pervised learning procedures in general, 
which - in this text - are corresponding 
to the following steps: 

Entering the input pattern (activation of 
input neurons), 

Forward propagation of the input by the 
network, generation of the output, 

Comparing the output with the desired 
output ( teaching input), provides er¬ 
ror vector (difference vector), 

Corrections of the network are 

calculated based on the error vector, 

Corrections are applied. 

1 The term error vector will be defined in section 
|4.2| where mathematical formalisation of learning 
is discussed. 


learning 

scheme 
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4.1.4 Offline or online learning? 


It must be noted that learning can be 
offline (a set of training samples is pre¬ 
sented, then the weights are changed, the 
total error is calculated by means of a error 
function operation or simply accumulated - 
see also section 4.4) or online (after every 
sample presented the weights are changed). 
Both procedures have advantages and dis¬ 
advantages, which will be discussed in the 
learning procedures section if necessary. 
Offline training procedures are also called 
batch training procedures since a batch 
of results is corrected all at once. Such a 
training section of a whole batch of train¬ 
ing samples including the related change 
in weight values is called epoch. 


Definition 4.5 (Offline learning). Sev¬ 
eral training patterns are entered into the 
network at once, the errors are accumu¬ 
lated and it learns for all patterns at the 
same time. 


Definition 4.6 (Online learning). The 
network learns directly from the errors of 
each training sample. 


> How must the weights be modified to 
allow fast and reliable learning? 

> How can the success of a learning pro¬ 
cess be measured in an objective way? 

> Is it possible to determine the "best" 
learning procedure? 

> Is it possible to predict if a learning 
procedure terminates, i.e. whether it 
will reach an optimal state after a fi¬ 
nite time or if it, for example, will os¬ 
cillate between different states? 

> How can the learned patterns be 
stored in the network? 

> Is it possible to avoid that newly 
learned patterns destroy previously 
learned associations (the so-called sta¬ 
bility/plasticity dilemma)? 

We will see that all these questions cannot 
be generally answered but that they have 
to be discussed for each learning procedure 
and each network topology individually. 


4.2 Training patterns and 
teaching input 


4.1.5 Questions you should answer 
before learning 

The application of such schemes certainly 
requires preliminary thoughts about some 
questions, which I want to introduce now 
as a check list and, if possible, answer 
them in the course of this text: 

> Where does the learning input come 
from and in what form? 


Before we get to know our first learning 
rule, we need to introduce the teaching 
input. In (this) case of supervised learn¬ 
ing we assume a training set consisting 
of training patterns and the correspond¬ 
ing correct output values we want to see 
at the output neurons after the training. 
While the network has not finished train¬ 
ing, i.e. as long as it is generating wrong 
outputs, these output values are referred 


◄ ◄ 

no easy 
answers! 


desired 

output 
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4.2 Training patterns and teaching input 


P 


t 

desired 

output 


to as teaching input, and that for each neu¬ 
ron individually. Thus, for a neuron j with 
the incorrect output Oj, tj is the teaching 
input, which means it is the correct or de¬ 
sired output for a training pattern p. 

Definition 4.7 (Training patterns). A 
training pattern is an input vector p 
with the components p \, P 2 ,..., p n whose 
desired output is known. By entering the 
training pattern into the network we re¬ 
ceive an output that can be compared with 
the teaching input, which is the desired 
output. The set of training patterns is 
called P. It contains a finite number of or¬ 
dered pairs(p, t) of training patterns with 
corresponding desired output. 

Training patterns are often simply called 
patterns, that is why they are referred 
to as p. In the literature as well as in 
this text they are called synonymously pat¬ 
terns, training samples etc. 

Definition 4.8 (Teaching input). Let j 
be an output neuron. The teaching in¬ 
put tj is the desired and correct value j 
should output after the input of a certain 
training pattern. Analogously to the vec¬ 
tor p the teaching inputs ti,t 2 , ■ • ■ ,t n of 
the neurons can also be combined into a 
vector t. t always refers to a specific train¬ 
ing pattern p and is, as already mentioned, 
contained in the set P of the training pat¬ 
terns. 

SIMIPE: Classes that are relevant 

for training data are located in 
the package training. The class 
TrainingSampleLesson allows for storage 
of training patterns and teaching inputs, 


as well as simple preprocessing of the 
training data. 

Definition 4.9 (Error vector). For sev¬ 
eral output neurons Hi, VL 2 ,..., the dif¬ 
ference between output vector and teach¬ 
ing input under a training input p 


is referred to as error vector , sometimes 
it is also called difference vector. De¬ 
pending on whether you are learning of¬ 
fline or online, the difference vector refers 
to a specific training pattern, or to the er¬ 
ror of a set of training patterns which is 
normalized in a certain way. 

Now I want to briefly summarize the vec¬ 
tors we have yet defined. There is the 

input vector x, which can be entered into 
the neural network. Depending on 
the type of network being used the 
neural network will output an 

output vector y. Basically, the 

training sample p is nothing more than 
an input vector. We only use it for 
training purposes because we know 
the corresponding 

teaching input t which is nothing more 
than the desired output vector to the 
training sample. The 

error vector E p is the difference between 
the teaching input t and the actural 
output y. 


E p 
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Important! 


So, what x and y are for the general net¬ 
work operation are p and t for the network 
training - and during training we try to 
bring y as close to t as possible. One ad¬ 
vice concerning notation: We referred to 
the output values of a neuron i as o*. Thus, 
the output of an output neuron 17 is called 
oq. But the output values of a network are 
referred to as y q. Certainly, these network 
outputs are only neuron outputs, too, but 
they are outputs of output neurons. In 
this respect 

Vn = oq 

is true. 


4.3 Using training samples 


We have seen how we can learn in prin¬ 
ciple and which steps are required to do 
so. Now we should take a look at the se¬ 
lection of training data and the learning 
curve. After successful learning it is par¬ 
ticularly interesting whether the network 
has only memorized - i.e. whether it can 
use our training samples to quite exactly 
produce the right output but to provide 
wrong answers for all other problems of 
the same class. 


Suppose that we want the network to train 
a mapping M 2 —> B 1 and therefor use the 
training samples from fig. |4.1| Then there 
could be a chance that, finally, the net¬ 
work will exactly mark the colored areas 
around the training samples with the out¬ 
put 1 (fig. 4.1, top), and otherwise will 
output 0 . Thus, it has sufficient storage 
capacity to concentrate on the six training 
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Figure 4.1: Visualization of training results of 
the same training set on networks with a capacity 
being too high (top), correct (middle) or too low 
(bottom). 
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4.3 Using training samples 


samples with the output 1. This implies 
an oversized network with too much free 
storage capacity. 


On the other hand a network could have 
insufficient capacity (fig. |4.1[ bottom) - 
this rough presentation of input data does 
not correspond to the good generalization 
performance we desire. Thus, we have to 
find the balance (fig. 4.1, middle). 


4.3.1 It is useful to divide the set of 
training samples 

An often proposed solution for these prob¬ 
lems is to divide , the training set into 

> one training set really used to train , 

> and one verification set to test our 
progress 

- provided that there are enough train¬ 
ing samples. The usual division relations 
are, for instance, 70% for training data 
and 30% for verification data (randomly 
chosen). We can finish the training when 
the network provides good results on the 
training data as well as on the verification 
data. 

SIMIPE: The method splitLesson within 
the class TrainingSampleLesson allows for 
splitting a TrainingSampleLesson with re¬ 
spect to a given ratio. 


But note: If the verification data provide 
poor results, do not modify the network 
structure until these data provide good re¬ 
sults - otherwise you run the risk of tai¬ 
loring the network to the verification data. 


This means, that these data are included 
in the training, even if they are not used 
explicitly for the training. The solution 
is a third set of validation data used only 
for validation after a supposably success¬ 
ful training. 

By training less patterns, we obviously 
withhold information from the network 
and risk to worsen the learning perfor¬ 
mance. But this text is not about 100% 
exact reproduction of given samples but 
about successful generalization and ap¬ 
proximation of a whole function - for 
which it can definitely be useful to train 
less information into the network. 


4.3.2 Order of pattern 
representation 

You can find different strategies to choose 
the order of pattern presentation: If pat¬ 
terns are presented in random sequence, 
there is no guarantee that the patterns 
are learned equally well (however, this is 
the standard method). Always the same 
sequence of patterns, on the other hand, 
provokes that the patterns will be memo¬ 
rized when using recurrent networks (later, 
we will learn more about this type of net¬ 
works). A random permutation would 
solve both problems, but it is - as already 
mentioned - very time-consuming to cal¬ 
culate such a permutation. 

SNIPE: The method shuffleSamples lo¬ 
cated in the class TrainingSampleLesson 
permutes a lesson. 
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4.4 Learning curve and error 
measurement 


Generally, the root mean square is com¬ 
monly used since it considers extreme out¬ 
liers to a greater extent. 


norm 

to 

compare 


Eri'p 


The learning curve indicates the progress 
of the error, which can be determined in 
various ways. The motivation to create a 
learning curve is that such a curve can in¬ 
dicate whether the network is progressing 
or not. For this, the error should be nor¬ 
malized, i.e. represent a distance measure 
between the correct and the current out¬ 
put of the network. For example, we can 
take the same pattern-specific, squared er¬ 
ror with a prefactor, which we are also go¬ 
ing to use to derive the backpropagation 
of error (let D be output neurons and O 
the set of output neurons): 

Err P = \ (to “ y ^) 2 ( 41 ) 

z neo 


Definition 4.12 (Root mean square). 
The root mean square of two vectors t and 
y is defined as 

Err p = (J 3) 

As for offline learning, the total error in 
the course of one training epoch is inter¬ 
esting and useful, too: 

Err = ^2 Err p (4.4) 

pSP 

Definition 4.13 (Total error). The total 
error Err is based on all training samples, 
that means it is generated offline. 


Definition 4.10 (Specific error). The 
specific error Err p is based on a single 
training sample, which means it is gener¬ 
ated online. 

Additionally, the root mean square (ab¬ 
breviated: RMS ) and the Euclidean 
distance are often used. 

The Euclidean distance (generalization of 
the theorem of Pythagoras) is useful for 
lower dimensions where we can still visual¬ 
ize its usefulness. 


Analogously we can generate a total RMS 
and a total Euclidean distance in the 
course of a whole epoch. Of course, it is 
possible to use other types of error mea¬ 
surement. To get used to further error 
measurement methods, I suggest to have a 
look into the technical report of Prechelt 
|Pre9 4|. In this report, both error mea¬ 
surement methods and sample problems 
are discussed (this is why there will be a 
simmilar suggestion during the discussion 
of exemplary problems). 


Definition 4.11 (Euclidean distance). 
The Euclidean distance between two vec¬ 
tors t and y is defined as 


SNIPE: There are several static meth¬ 
ods representing different methods of er¬ 
ror measurement implemented in the class 
ErrorMeasurement. 


Err p = 


(tn - yn) 2 - 

oeo 


( 4 - 2 ) Depending on our method of error mea¬ 
surement our learning curve certainly 
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4.4 Learning curve and error measurement 


changes, too. A perfect learning curve 
looks like a negative exponential func¬ 
tion, that means it is proportional to e _t 
(fig-E .2 on the following page[ ). Thus, the 
representation of the learning curve can be 
illustrated by means of a logarithmic scale 
(fig. 4.2 second diagram from the bot¬ 


tom) - with the said scaling combination 
a descending line implies an exponential 
descent of the error. 


With the network doing a good job, the 
problems being not too difficult and the 
logarithmic representation of Err you can 
see - metaphorically speaking - a descend¬ 
ing line that often forms "spikes" at the 
bottom - here, we reach the limit of the 
64-bit resolution of our computer and our 
network has actually learned the optimum 
of what it is capable of learning. 

Typical learning curves can show a few flat 
areas as well, i.e. they can show some 
steps, which is no sign of a malfunctioning 
learning process. As we can also see in fig. 
|4.2| a well-suited representation can make 
any slightly decreasing learning curve look 
good - so just be cautious when reading 
the literature. 


4.4.1 When do we stop learning? 

Now, the big question is: When do we 
stop learning? Generally, the training is 
stopped when the user in front of the learn¬ 
ing computer "thinks" the error was small 
enough. Indeed, there is no easy answer 
and thus I can once again only give you 
something to think about, which, however, 


depends on a more objective view on the 
comparison of several learning curves. 

Confidence in the results, for example, is 
boosted, when the network always reaches 
nearly the same final error-rate for differ¬ 
ent random initializations - so repeated 
initialization and training will provide a 
more objective result. 

On the other hand, it can be possible that 
a curve descending fast in the beginning 
can, after a longer time of learning, be 
overtaken by another curve: This can indi¬ 
cate that either the learning rate of the 
worse curve was too high or the worse 
curve itself simply got stuck in a local min¬ 
imum, but was the first to find it. 

Remember: Larger error values are worse 
than the small ones. 

But, in any case, note: Many people only 
generate a learning curve in respect of the 
training data (and then they are surprised 
that only a few things will work) - but for 
reasons of objectivity and clarity it should 
not be forgotten to plot the verification 
data on a second learning curve, which 
generally provides values that are slightly 
worse and with stronger oscillation. But 
with good generalization the curve can de¬ 
crease, too. 

When the network eventually begins to 
memorize the samples, the shape of the 
learning curve can provide an indication: 
If the learning curve of the verification 
samples is suddenly and rapidly rising 
while the learning curve of the verification 


objectivity 
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Epoche 



Epoche 



Epoche 


Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve. 
Note the alternating logarithmic and linear scalings! Also note the small "inaccurate spikes" visible 
in the sharp bend of the curve in the first and second diagram from bottom. 
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4.5 Gradient optimization procedures 


data is continuously falling, this could indi¬ 
cate memorizing and a generalization get¬ 
ting poorer and poorer. At this point it 
could be decided whether the network has 
already learned well enough at the next 
point of the two curves, and maybe the 
final point of learning is to be applied 
here (this procedure is called early stop¬ 
ping). 

Once again I want to remind you that they 
are all acting as indicators and not to draw 
If-Then conclusions. 

4.5 Gradient optimization 
procedures 


of its norm \g\. Thus, the gradient is a 
generalization of the derivative for multi¬ 
dimensional functions. Accordingly, the 
negative gradient —g exactly points to¬ 
wards the steepest descent. The gradient 
operator V is referred to as nabla op¬ 
erator, the overall notation of the the 
gradient g of the point (x,y) of a two- 
dimensional function / being g(x, y) = 

V/(x,y). 

Definition 4.14 (Gradient). Let g be 
a gradient. Then g is a vector with n 
components that is defined for any point 
of a (differential) n-dimensional function 
f(xi,X 2 , ■ ■ ■, x n ). The gradient operator 
notation is defined as 

g(x i,x 2 , ...,x n ) = Vf(x i,x 2 , ■ ■ .,x n ). 


V 


gradient is 
multi-dim. 
derivative 


In order to establish the mathematical ba¬ 
sis for some of the following learning pro¬ 
cedures I want to explain briefly what is 
meant by gradient descent: the backpro- 
pagation of error learning procedure, for 
example, involves this mathematical basis 
and thus inherits the advantages and dis¬ 
advantages of the gradient descent. 


Gradient descent procedures are generally 
used where we want to maximize or mini¬ 
mize n-dimensional functions. Due to clar¬ 


ity the illustration (fig. 4.3 on the next 


page) shows only two dimensions, but prin¬ 


cipally there is no limit to the number of 
dimensions. 


The gradient is a vector g that is de¬ 
fined for any differentiable point of a func¬ 
tion, that points from this point exactly 
towards the steepest ascent and indicates 
the gradient in this direction by means 


g directs from any point of / towards 
the steepest ascent from this point, with 
\g\ corresponding to the degree of this as¬ 
cent. 

Gradient descent means to going downhill 
in small steps from any starting point of 
our function towards the gradient g (which 
means, vividly speaking, the direction to 
which a ball would roll from the starting 
point), with the size of the steps being pro¬ 
portional to | < 7 1 (the steeper the descent, 
the longer the steps). Therefore, we move 
slowly on a flat plateau, and on a steep as¬ 
cent we run downhill rapidly. If we came 
into a valley, we would - depending on the 
size of our steps - jump over it or we would 
return into the valley across the opposite 
hillside in order to come closer and closer 
to the deepest point of the valley by walk¬ 
ing back and forth, similar to our ball mov¬ 
ing within a round bowl. 
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Figure 4.3: Visualization of the gradient descent on a two-dimensional error function. We 
move forward in the opposite direction of g, i.e. with the steepest descent towards the lowest 
point, with the step width being proportional to |t/| (the steeper the descent, the faster the 
steps). On the left the area is shown in 3D, on the right the steps over the contour lines are 
shown in 2D. Here it is obvious how a movement is made in the opposite direction of g towards 
the minimum of the function and continuously slows down proportionally to |c/|. Source: 
http://webster.fhs-hagenberg.ac.at/staff/sdreisei/Teaching/WS2001-2002/ 
PatternClassification/graddescent.pdf 


We go 
towards the 
gradient 


Definition 4.15 (Gradient descent). 
Let / be an n- dimensional function and 
s = (si, S 2 , ■ ■ ■, s„) the given starting 
point. Gradient descent means going 
from /(s) against the direction of g , i.e. 
towards —g with steps of the size of \g\ 
towards smaller and smaller values of /. 

Gradient descent procedures are not an er¬ 
rorless optimization procedure at all (as 
we will see in the following sections) - how¬ 
ever, they work still well on many prob¬ 
lems, which makes them an optimization 
paradigm that is frequently used. Anyway, 
let us have a look on their potential disad¬ 
vantages so we can keep them in mind a 
bit. 


4.5.1 Gradient procedures 

incorporate several problems 

As already implied in section |4.5[ the gra¬ 
dient descent (and therefore the backpro- 
pagation) is promising but not foolproof. 
One problem, is that the result does not 
always reveal if an error has occurred. 


4.5.1.1 Often, gradient descents 

converge against suboptimal 
minima 


Every gradient descent procedure can, for 
example, get stuck within a local mini¬ 
mum (part a of fig. 


4.4 on the facing page 


gradient 
descent 
with errors 
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4.5 Gradient optimization procedures 



Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstill 
with small gradient, c) Oscillation in canyons, d) Leaving good minima. 


This problem is increasing proportionally 
to the size of the error surface, and there 
is no universal solution. In reality, one 
cannot know if the optimal minimum is 
reached and considers a training success¬ 
ful, if an acceptable minimum is found. 


4.5.1.2 Flat plataeus on the error 
surface may cause training 
slowness 


When passing a flat plateau, for instance, 
the gradient also becomes negligibly small 
because there is hardly a descent (part b 
of fig. 4.4), which requires many further 


steps. A hypothetically possible gradient 
of 0 would completely stop the descent. 


4.5.1.3 Even if good minima are 
reached, they may be left 
afterwards 


On the other hand the gradient is very 
large at a steep slope so that large steps 
can be made and a good minimum can pos¬ 
sibly be missed (part d of fig. 4.4). 


4.5.1.4 Steep canyons in the error 

surface may cause oscillations 


A sudden alternation from one very strong 
negative gradient to a very strong positive 
one can even result in oscillation (part c 
of fig. 4.4). In nature, such an error does 
not occur very often so that we can think 
about the possibilities b and d. 
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4.6 Exemplary problems allow 
for testing self-coded 
learning strategies 


We looked at learning from the formal 
point of view - not much yet but a little. 
Now it is time to look at a few exemplary 
problem you can later use to test imple¬ 
mented networks and learning rules. 


4.6.1 Boolean functions 

A popular example is the one that did 
not work in the nineteen-sixties: the XOR 
function (B 2 —y B 1 ). We need a hidden 
neuron layer, which we have discussed in 
detail. Thus, we need at least two neu¬ 
rons in the inner layer. Let the activation 
function in all layers (except in the input 
layer, of course) be the hyperbolic tangent. 
Trivially, we now expect the outputs 1.0 
or —1.0, depending on whether the func¬ 
tion XOR outputs 1 or 0 - and exactly 
here is where the first beginner’s mistake 
occurs. 

For outputs close to 1 or -1, i.e. close to 
the limits of the hyperbolic tangent (or 
in case of the Fermi function 0 or 1), we 
need very large network inputs. The only 
chance to reach these network inputs are 
large weights, which have to be learned: 
The learning process is largely extended. 
Therefore it is wiser to enter the teaching 
inputs 0.9 or —0.9 as desired outputs or 
to be satisfied when the network outputs 
those values instead of 1 and —1. 


ii 

*2 

*3 
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0 

0 

1 

0 

0 

0 

1 

1 

1 

1 
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0 

1 

1 

1 

1 

0 

1 

1 

1 

1 

0 


Table 4.1: Illustration of the parity function 
with three inputs. 


Another favourite example for singlelayer 
perceptrons are the boolean functions 
AND and OR. 


4.6.2 The parity function 


The parity function maps a set of bits to 1 
or 0, depending on whether an even num¬ 
ber of input bits is set to 1 or not. Ba¬ 
sically, this is the function B n —> B 1 . It 
is characterized by easy learnability up to 
approx, n = 3 (shown in table 4.1), but 


the learning effort rapidly increases from 
n = 4. The reader may create a score ta¬ 
ble for the 2-bit parity function. What is 


conspicuous 


? 


4.6.3 The 2-spiral problem 


As a training sample for a function let 
us take two spirals coiled into each other 
(fig. 4.5 on the facing page) with the 


function certainly representing a mapping 
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4.6 Exemplary problems 
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Figure 4.5: Illustration of the training samples Figure 4.6: Illustration of training samples for 
of the 2-spiral problem the checkerboard problem 


M 2 —y B 1 . One of the spirals is assigned 
to the output value 1, the other spiral to 
0. Here, memorizing does not help. The 
network has to understand the mapping it¬ 
self. This example can be solved by means 
of an MLP, too. 

4.6.4 The checkerboard problem 


suitable for this kind of problems than the 
MLP). 

The 2-spiral problem is very similar to the 
checkerboard problem, only that, mathe¬ 
matically speaking, the first problem is us¬ 
ing polar coordinates instead of Cartesian 
coordinates. I just want to introduce as 
an example one last trivial case: the iden¬ 
tity. 


We again create a two-dimensional func¬ 
tion of the form M 2 — > B 1 and specify 
checkered training samples (fig. 4.6) with 
one colored field representing 1 and all the 
rest of them representing 0. The difficulty 
increases proportionally to the size of the 
function: While a 3 x 3 field is easy to learn, 
the larger fields are more difficult (here 
we eventually use methods that are more 


4.6.5 The identity function 

By using linear activation functions the 
identity mapping from M 1 to M 1 (of course 
only within the parameters of the used ac¬ 
tivation function) is no problem for the 
network, but we put some obstacles in its 
way by using our sigmoid functions so that 
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early 
form of 
the rule 


it would be difficult for the network to 
learn the identity. Just try it for the fun 
of it. 

Now, it is time to hava a look at our first 
mathematical learning rule. 

4.6.6 There are lots of other 
exemplary problems 


A Wij ~ 7jOiaj (4.5) 

with A Wij being the change in weight 
from i to j , which is proportional to the 
following factors: 

> the output Oi of the predecessor neu¬ 
ron i. as well as, 

> the activation aj of the successor neu¬ 
ron j, 


For lots and lots of further exemplary prob¬ 
lems, I want to recommend the technical 
report written by prechelt |Pre94 which 
also has been named in the sections about 
error measurement procedures.. 


4.7 The Hebbian learning rule 
is the basis for most 
other learning rules 


In 1949, Donald O. Hebb formulated 
the Hebbian rule |Heb49 which is the ba¬ 
sis for most of the more complicated learn¬ 
ing rules we will discuss in this text. We 
distinguish between the original form and 
the more general form, which is a kind of 
principle for other learning rules. 


4.7.1 Original rule 

Definition 4.16 (Hebbian rule). "If neu¬ 
ron j receives an input from neuron i and 
if both neurons are strongly active at the 
same time, then increase the weight Wjj 
(i.e. the strength of the connection be¬ 
tween i and j )." Mathematically speaking, 
the rule is: 


> a constant 77 , i.e. the learning rate, 

which will be discussed in section 

EMI 

The changes in weight A Wjj are simply 
added to the weight Wjj. 

Why am I speaking twice about activation , 
but in the formula I am using Oj and aj, i.e. 
the output of neuron of neuron i and the ac¬ 
tivation of neuron jl Remember that the 
identity is often used as output function 
and therefore aj and Oj of a neuron are of¬ 
ten the same. Besides, Hebb postulated 
his rule long before the specification of 
technical neurons. Considering that this 
learning rule was preferred in binary acti¬ 
vations, it is clear that with the possible 
activations (1,0) the weights will either in¬ 
crease or remain constant. Sooner or later 
they would go ad infinitum, since they can 
only be corrected "upwards" when an error 
occurs. This can be compensated by using 
the activations (-1,1) 2 . Thus, the weights 
are decreased when the activation of the 
predecessor neuron dissents from the one 
of the successor neuron, otherwise they are 
increased. 

2 But that is no longer the "original version" of the 
Hebbian rule. 
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4.7 Hebbian rule 


4.7.2 Generalized form 


Exercises 


Most of the learning rules discussed before 
are a specialization of the mathematically 
more general form |MR86 
rule. 


of the Hebbian 


Definition 4.17 (Hebbian rule, more gen¬ 
eral). The generalized form of the 
Hebbian Rule only specifies the propor¬ 
tionality of the change in weight to the 
product of two undefined functions, but 
with defined input values. 


Exercise 7. Calculate the average value 
g and the standard deviation a for the fol¬ 
lowing data points. 

pi = (2,2,2) 
p2 = (3, 3,3) 
p3 = (4,4,4) 
p4 =(6,0,0) 
p5 = (0, 6,0) 

p6 = (0, 0,6) 


A Wij = rj ■ h(oi , Wij) ■ g(aj,tj ) (4.6) 


Thus, the product of the functions 

> g(aj,tj) and 

> h(oi,Wij) 

> as well as the constant learning rate 

T] 

results in the change in weight. As you 
can see, h receives the output of the pre¬ 
decessor cell Oi as well as the weight from 
predecessor to successor Wij while g ex¬ 
pects the actual and desired activation of 
the successor aj and tj (here t stands for 
the aforementioned teaching input). As al¬ 
ready mentioned g and h are not specified 
in this general definition. Therefore, we 
will now return to the path of specializa¬ 
tion we discussed before equation |4.6[ Af¬ 
ter we have had a short picture of what 
a learning rule could look like and of our 
thoughts about learning itself, we will be 
introduced to our first network paradigm 
including the learning procedure. 
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Chapter 5 

The perceptron, backpropagation and its 
variants 


A classic among the neural networks. If we talk about a neural network, then 
in the majority of cases we speak about a percepton or a variation of it. 
Perceptrons are multilayer networks without recurrence and with fixed input 
and output layers. Description of a perceptron, its limits and extensions that 
should avoid the limitations. Derivation of learning procedures and discussion 

of their problems. 


As already mentioned in the history of neu¬ 
ral networks, the perceptron was described 
by Frank Rosenblatt in 1958 |Ros58 . 
Initially, Rosenblatt defined the already 
discussed weighted sum and a non-linear 
activation function as components of the 
perceptron. 


There is no established definition for a per¬ 
ceptron, but most of the time the term 
is used to describe a feedforward network 
with shortcut connections. This network 
has a layer of scanner neurons ( retina ) 
with statically weighted connections to 
the following layer and is called input 
layer (fig. 5.1 on the next page); but the 
weights of all other layers are allowed to be 
changed. All neurons subordinate to the 
retina are pattern detectors. Here we ini¬ 
tially use a binary perceptron with every 
output neuron having exactly two possi¬ 


ble output values (e.g. {0,1} or {—1,1}). 
Thus, a binary threshold function is used 
as activation function, depending on the 
threshold value 0 of the output neuron. 

In a way, the binary activation function 
represents an IF query which can also 
be negated by means of negative weights. 
The perceptron can thus be used to ac¬ 
complish true logical information process¬ 
ing. 

Whether this method is reasonable is an¬ 
other matter - of course, this is not the 
easiest way to achieve Boolean logic. I just 
want to illustrate that perceptrons can 
be used as simple logical components and 
that, theoretically speaking, any Boolean 
function can be realized by means of per¬ 
ceptrons being connected in series or in¬ 
terconnected in a sophisticated way. But 
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Figure 5.1: Architecture of a perceptron with one layer of variable connections in different views. 
The solid-drawn weight layer in the two illustrations on the bottom can be trained. 

Left side: Example of scanning information in the eye. 

Right side, upper part: Drawing of the same example with indicated fixed-weight layer using the 
defined designs of the functional descriptions for neurons. 

Right side, lower part: Without indicated fixed-weight layer, with the name of each neuron 
corresponding to our convention. The fixed-weight layer will no longer be taken into account in the 
course of this work. 
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input neuron 
only forwards 
data 


we will see that this is not possible without 
connecting them serially. Before providing 
the definition of the perceptron, I want to 
define some types of neurons used in this 
chapter. 

Definition 5.1 (Input neuron). An in¬ 
put neuron is an identity neuron. It 

exactly forwards the information received. 
Thus, it represents the identity function, 
which should be indicated by the symbol 
/. Therefore the input neuron is repre¬ 
sented by the symbol (7)- 

Definition 5.2 (Information process¬ 
ing neuron). Information processing 
neurons somehow process the input infor¬ 
mation, i.e. do not represent the identity 
function. A binary neuron sums up all 
inputs by using the weighted sum as prop¬ 
agation function, which we want to illus¬ 
trate by the sign £. Then the activation 
function of the neuron is the binary thresh¬ 
old function, which can be illustrated by 
_I . This leads us to the complete de¬ 
piction of information processing neurons, 


namely 



Other neurons that use 


the weighted sum as propagation function 
but the activation functions hyperbolic tan¬ 
gent or Fermi function, or with a sepa¬ 
rately defined activation function / ac t, are 
similarly represented by 


Now that we know the components of a 
perceptron we should be able to define 
it. 

Definition 5.3 (Perceptron). The per¬ 
ceptron (fig. 5.1 on the facing page) is 1 a 


feedforward network containing a retina 
that is used only for data acquisition and 
which has fixed-weighted connections with 
the first neuron layer (input layer). The 
fixed-weight layer is followed by at least 
one trainable weight layer. One neuron 
layer is completely linked with the follow¬ 
ing layer. The first layer of the percep¬ 
tron consists of the input neurons defined 
above. 


A feedforward network often contains 
shortcuts which does not exactly corre¬ 
spond to the original description and there¬ 
fore is not included in the definition. We 
can see that the retina is not included in 
the lower part of fig. 


5.1 As a matter 


of fact the first neuron layer is often un¬ 
derstood (simplified and sufficient for this 
method) as input layer, because this layer 
only forwards the input values. The retina 
itself and the static weights behind it are 
no longer mentioned or displayed, since 
they do not process information in any 
case. So, the depiction of a perceptron 
starts with the input neurons. 


retina is 
unconsidered 



These neurons are also referred to as 

Fermi neurons or Tank neuron. 


1 It may confuse some readers that I claim that there 
is no definition of a perceptron but then define the 
perceptron in the following section. I therefore 
suggest keeping my definition in the back of your 
mind and just take it for granted in the course of 
this work. 
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1 trainable 
layer 


Important! 


SNIPE: The methods 

setSettingsTopologyFeedForward 
and the variation -WithShortcuts in 
a NeuralNetworkDescriptor-Instance 
apply settings to a descriptor, which 
are appropriate for feedforward networks 
or feedforward networks with shortcuts. 
The respective kinds of connections are 
allowed, all others are not, and fastprop is 
activated. 



Y 


5.1 The singlelayer 

perceptron provides only 
one trainable weight layer 


Here, connections with trainable weights 
go from the input layer to an output 
neuron H, which returns the information 
whether the pattern entered at the input 
neurons was recognized or not. Thus, a 
singlelayer perception (abbreviated SLP) 
has only one level of trainable weights 
(fig. 5.1 on page 72). 


Definition 5.4 (Singlelayer perceptron). 

A singlelayer perceptron (SLP) is a 
perceptron having only one layer of vari¬ 
able weights and one layer of output neu¬ 
rons H. The technical view of an SLP is 
shown in fig. |5.2| 


Certainly, the existence of several output 
neurons Hi, ■ ■ ■, H n does not consider¬ 
ably change the concept of the perceptron 
(fig. 5.3): A perceptron with several out¬ 


put neurons can also be regarded as sev¬ 
eral different perceptrons with the same 
input. 


Figure 5.2: A singlelayer perceptron with two in¬ 
put neurons and one output neuron. The net¬ 
work returns the output by means of the ar¬ 
row leaving the network. The trainable layer of 
weights is situated in the center (labeled). As a 
reminder, the bias neuron is again included here. 
Although the weight icbias.o is a normal weight 
and also treated like this, I have represented it 
by a dotted line - which significantly increases 
the clarity of larger networks. In future, the bias 
neuron will no longer be included. 



Figure 5.3: Singlelayer perceptron with several 
output neurons 
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5.1 The singlelayer perceptron 



Figure 5.4: Two singlelayer perceptrons for 
Boolean functions. The upper singlelayer per¬ 
ceptron realizes an AND, the lower one realizes 
an OR. The activation function of the informa¬ 
tion processing neuron is the binary threshold 
function. Where available, the threshold values 
are written into the neurons. 


5.1.1 Perceptron learning algorithm 
and convergence theorem 

The original perceptron learning algo¬ 
rithm with binary neuron activation func¬ 
tion is described in alg. [l] It has been 
proven that the algorithm converges in 
finite time - so in finite time the per¬ 
ceptron can learn anything it can repre¬ 
sent (perceptron convergence theorem , 
[R os62] ). But please do not get your hopes 
up too soon! What the perceptron is capa¬ 
ble to represent will be explored later. 

During the exploration of linear separabil¬ 
ity of problems we will cover the fact that 
at least the singlelayer perceptron unfor¬ 
tunately cannot represent a lot of prob¬ 
lems. 


5.1.2 The delta rule as a gradient 
based learning strategy for 
SLPs 


The Boolean functions AND and OR shown 
in fig. |5.4| are trivial examples that can eas¬ 
ily be composed. 

Now we want to know how to train a single¬ 
layer perceptron. We will therefore at first 
take a look at the perceptron learning al¬ 
gorithm and then we will look at the delta 
rule. 


In the following we deviate from our bi¬ 
nary threshold value as activation function 
because at least for backpropagation of er¬ 
ror we need, as you will see, a differen¬ 
tiable or even a semi-linear activation func¬ 
tion. For the now following delta rule (like 
backpropagation derived in |MR86 ) it is 
not always necessary but useful. This fact, 
however, will also be pointed out in the 
appropriate part of this work. Compared 
with the aforementioned perceptron learn¬ 
ing algorithm, the delta rule has the ad¬ 
vantage to be suitable for non-binary acti¬ 
vation functions and, being far away from 


/act now differ¬ 
entiable 
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1: while 3p £ P and error too large do 

2: Input p into the network, calculate output y {P set of training patterns} 

3: for all output neurons do 

4: if yn = tn then 

5: Output is okay, no correction of weights 

6: else 

7: if Vn = 0 then 

8: for all input neurons i do 

9: Wi t n := + Oi {...increase weight towards fl by cy} 

10: end for 

11: end if 

12: if Vu = 1 then 

13: for all input neurons i do 

14: Wifi := Wifi — Oi {...decrease weight towards by Oi} 

15: end for 

16: end if 

17: end if 

18: end for 

19: end while 

Algorithm 1: Perceptron learning algorithm. The perceptron learning algorithm 
reduces the weights to output neurons that return 1 instead of 0, and in the inverse 
case increases weights. 
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5.1 The singlelayer perceptron 


the learning target, to automatically learn 
faster. 

Suppose that we have a singlelayer percep¬ 
tron with randomly set weights which we 
want to teach a function by means of train¬ 
ing samples. The set of these training sam¬ 
ples is called P. It contains, as already de¬ 
fined, the pairs (p, t ) of the training sam¬ 
ples p and the associated teaching input t. 
I also want to remind you that 

> x is the input vector and 

> y is the output vector of a neural net¬ 
work, 

> output neurons are referred to as 

III, 0 2 , ■ ■ • , fi|0|, 

> i is the input and 

> o is the output of a neuron. 
Additionally, we defined that 

> the error vector E p represents the dif¬ 
ference ( t—y ) under a certain training 
sample p. 

> Furthermore, let O be the set of out¬ 
put neurons and 

> / be the set of input neurons. 

Another naming convention shall be that, 
for example, for an output o and a teach¬ 
ing input t an additional index p may be 
set in order to indicate that these values 
are pattern-specific. Sometimes this will 
considerably enhance clarity. 

Now our learning target will certainly be, 
that for all training samples the output y 


of the network is approximately the de¬ 
sired output t, i.e. formally it is true 
that 

Vp : y « t or Vp : E p « 0. 

This means we first have to understand the 
total error Err as a function of the weights: 
The total error increases or decreases de¬ 
pending on how we change the weights. 

Definition 5.5 (Error function). The er¬ 
ror function 

Err : W —> M 

regards the set 2 of weights IE as a vector 
and maps the values onto the normalized 
output error (normalized because other¬ 
wise not all errors can be mapped onto 
one single e E M to perform a gradient de¬ 
scent). It is obvious that a specific error 
function can analogously be generated 
for a single pattern p. 

As already shown in section |4.5[ gradient 
descent procedures calculate the gradient 
of an arbitrary but finite-dimensional func¬ 
tion (here: of the error function Err(IE)) 
and move down against the direction of 
the gradient until a minimum is reached. 
Err(W) is defined on the set of all weights 
which we here regard as the vector IE. 
So we try to decrease or to minimize the 
error by simply tweaking the weights - 
thus one receives information about how 
to change the weights (the change in all 

2 Following the tradition of the literature, I previ¬ 
ously defined W as a weight matrix. I am aware 
of this conflict but it should not bother us here. 


Err (IE) 


error as 
function 


Errp(IE) 
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Figure 5.5: Exemplary error surface of a neural 
network with two trainable connections w\ und 
u> 2 ■ Generally, neural networks have more than 
two connections, but this would have made the 
illustration too complex. And most of the time 
the error surface is too craggy, which complicates 
the search for the minimum. 


weights is referred to as AW) by calcu¬ 
lating the gradient VErr(IE) of the error 
function Err(VE): 

AW ~ -VErr(TE). (5.1) 

Due to this relation there is a proportional¬ 
ity constant ry for which equality holds (rj 
will soon get another meaning and a real 
practical use beyond the mere meaning of 
a proportionality constant. I just ask the 
reader to be patient for a while.): 

AW = — ?yVErr(H7). (5.2) 

To simplify further analysis, we now 
rewrite the gradient of the error-function 
according to all weights as an usual par¬ 
tial derivative according to a single weight 
Wi t fi (the only variable weights exists be¬ 
tween the hidden and the output layer D). 


Thus, we tweak every single weight and ob¬ 
serve how the error function changes, i.e. 
we derive the error function according to 
a weight and obtain the value A 
of how to change this weight. 


A w it n = -ry 


<9Err(IE) 

dwitf 


(5.3) 


Now the following question arises: How 
is our error function defined exactly? It 
is not good if many results are far away 
from the desired ones; the error function 
should then provide large values - on the 
other hand, it is similarly bad if many 
results are close to the desired ones but 
there exists an extremely far outlying re¬ 
sult. The squared distance between the 
output vector y and the teaching input t 
appears adequate to our needs. It provides 
the error Err p that is specific for a train¬ 
ing sample p over the output of all output 
neurons f2: 


Erip(IE) = (*p.n “ Vp^f- (5.4) 

z f2eO 


Thus, we calculate the squared difference 
of the components of the vectors t and 
y, given the pattern p, and sum up these 
squares. The summation of the specific er¬ 
rors Err p (W) of all patterns p then yields 
the definition of the error Err and there- 
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5.1 The singlelayer perceptron 


fore the definition of the error function results from the sum of the specific er- 
Err ( W ): rors): 


Err(W) = ^ Err p (W) 
peP 

sum over all p 


(5.5) 


1 


— o E E Vpt 


p£P \fl£0 


sum over all f2 


(5.6) 


The observant reader will certainly wonder 
where the factor ^ in equation 5.4 on the 


preceding page suddenly came from and 
why there is no root in the equation, as 
this formula looks very similar to the Eu¬ 
clidean distance. Both facts result from 
simple pragmatics: Our intention is to 
minimize the error. Because the root func¬ 
tion decreases with its argument, we can 
simply omit it for reasons of calculation 
and implementation efforts, since we do 
not need it for minimization. Similarly, it 
does not matter if the term to be mini¬ 
mized is divided by 2: Therefore I am al¬ 
lowed to multiply by This is just done 
so that it cancels with a 2 in the course of 
our calculation. 


A w it n = -V 


dErr(W) 

dvj t si 


= E ~ T I 

p&p 


<9Err p (W) 

dvj,si 


(5.7) 

(5.8) 


Once again I want to think about the ques¬ 
tion of how a neural network processes 
data. Basically, the data is only trans¬ 
ferred through a function, the result of the 
function is sent through another one, and 
so on. If we ignore the output function, 
the path of the neuron outputs oq and Oj 2 , 
which the neurons i\ and i 2 entered into a 
neuron II, initially is the propagation func¬ 
tion (here weighted sum), from which the 
network input is going to be received. This 
is then sent through the activation func¬ 
tion of the neuron 12 so that we receive 
the output of this neuron which is at the 
same time a component of the output vec¬ 
tor y: 


t /act 

= /act(netQ) 

= 00 


Now we want to continue deriving the 
delta rule for linear activation functions. 
We have already discussed that we tweak 
the individual weights a bit and see 
how the error Err(IE) is changing - which 
corresponds to the derivative of the er¬ 
ror function Err(IE) according to the very 
same weight Wi^. This derivative cor¬ 
responds to the sum of the derivatives 
of all specific errors Err p according to 
this weight (since the total error Err(IE) 


= Vn- 

As we can see, this output results from 
many nested functions: 

on = /act(netn) (5.9) 

= /act(oq • w iu n + o i2 • w i2i n). (5.10) 

It is clear that we could break down the 
output into the single input neurons (this 
is unnecessary here, since they do not 
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process information in an SLP). Thus, 
we want to calculate the derivatives of 


equation 5.8 on the preceding page and 
due to the nested functions we can apply 
the chain rule to fac torize the derivative 
~ in equation 5.8 on the previous 


dwi 

iPagcl 


3Err p (WQ _ 3Err p (WQ do pM 
dw it n do P) Q dwitf' 

Let us take a look at the first multiplica¬ 


tive factor of the above equation 5.11 


which represents the derivative of the spe¬ 
cific error Err p [W) according to the out¬ 
put, i.e. the change of the error Err p 
with an output o p ^: The examination 


of Err p (equation 5.4 on page 78) clearly 
shows that this change is exactly the dif¬ 
ference between teaching input and out¬ 
put (tp t n — o Pj q) (remember: Since El is an 
output neuron, o p ^ = y Pi ci)- The closer 
the output is to the teaching input, the 
smaller is the specific error. Thus we can 
replace one by the other. This difference 
is also called 5 Pj q (which is the reason for 
the name delta rule): 


<9Err p (W) 

dwi t n 




x d°p,n 


The second multiplicative factor of equa¬ 
tion 5.11 and of the following one is the 
derivative of the output specific to the pat¬ 
tern p of the neuron according to the 
weight Wi t q. So how does o Pi q change 
when the weight from i to is changed? 


Due to the requirement at the beginning of 
the derivation, we only have a linear acti¬ 
vation function / act , therefore we can just 
as well look at the change of the network 
input when wy.Q is changing: 

dErr p (W) = _ dEiej(op,»w»,n) 

duiifr p,n dwiji 

(5.14) 


The resulting derivative 
can now be simplified: The function 
J2i£i(°p,i w i,ci) t° be derived consists of 
many summands, and only the sum¬ 
mand Op^Wi^ contains the variable w t q , 
according to which we derive. Thus, 

9 ^' ie di° P n W ‘ Xt ' > = °p 4 and therefore: 


<9Err p (tE) 

dw it a 


- 6 , 


[ p,Q ‘ Op,i 


— ■ <5 P ,n- 


(5.15) 

(5.16) 


We insert this in equation 5.8 on the previ 


ous pagel which results in our modification 


rule for a weight Wisp. 


A Wi& = rj ■ ^2 o Pti ■ 5 P &- (5.17) 

pGP 

However: From the very beginning the 
derivation has been intended as an offline 
rule by means of the question of how to 
add the errors of all patterns and how to 
learn them after all patterns have been 
represented. Although this approach is 
mathematically correct, the implementa¬ 
tion is far more time-consuming and, as 
we will see later in this chapter, partially 
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5.2 Linear separability 


needs a lot of compuational effort during 
training. 

The "online-learning version" of the delta 
rule simply omits the summation and 
learning is realized immediately after the 
presentation of each pattern, this also sim¬ 
plifies the notation (which is no longer nec¬ 
essarily related to a pattern p): 


In. 1 

In. 2 

Output 

0 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

0 


Table 5.1: Definition of the logical XOR. The 
input values are shown of the left, the output 
values on the right. 


A w it n = V ■ Oi ■ 5 q. (5.18) 


This version of the delta rule shall be used 
for the following definition: 


Definition 5.6 (Delta rule). If we deter¬ 
mine, analogously to the aforementioned 
derivation, that the function h of the Heb- 
bian theory (equation 4.6 on page 67) only 
provides the output Oj of the predecessor 
neuron i and if the function g is the differ¬ 
ence between the desired activation tn and 
the actual activation an, we will receive 
the delta rule, also known as Widrow- 
Hoff rule: 


D> to the difference between the current 
activation or output an or on and the 
corresponding teaching input tn- We 
want to refer to this factor as 5n , 
which is also referred to as "Delta". 

Apparently the delta rule only applies for 
SLPs, since the formula is always related 
to the teaching input, and there is no 
teaching input for the inner processing lay¬ 
ers of neurons. 


5 


delta rule 
only for SLP 


&Wi,a = V ■ Oi ■ (tn - an) = r)Oi5 n (5.19) 

If we use the desired output (instead of the 
activation) as teaching input, and there¬ 
fore the output function of the output neu¬ 
rons does not represent an identity, we ob¬ 
tain 

A Wi,n = rj ■ Oi ■ (tn - on) = poiSn (5.20) 

and then corresponds to the difference 
between tn and on- 


5.2 A SLP is only capable of 
representing linearly 
separable data 


Let / be the XOR function which expects 
two binary inputs and generates a binary 
output (for the precise definition see ta¬ 
ble |5Tl). 


In the case of the delta rule, the change 
of all weights to an output neuron D is 
proportional 


Let us try to represent the XOR func¬ 
tion by means of an SLP with two input 
neurons ii,Z 2 and one output neuron D 
(fig. 5.6 on the following page). 
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Figure 5.6: Sketch of a singlelayer perceptron 
that shall represent the XOR function - which is 
impossible. 


Here we use the weighted sum as propaga¬ 
tion function, a binary activation function 
with the threshold value 0 and the iden¬ 
tity as output function. Depending on i\ 
and i r 2 . 0 has to output the value 1 if the 
following holds: 



netQ = o^Wi^n + Oi 2 w i2) a > 0^ (5.21) 


We assume a positive weight the in- 

; 1 

(0q — Oi 2 Wi 2t n ) (5.22) 


equality 5.21 is then equivalent to 
1 


> 


Wi 1: n 


Figure 5.7: Linear separation of n = 2 inputs of 
the input neurons i 1 and i 2 by a 1-dimensional 
straight line. A and B show the corners belong¬ 
ing to the sets of the XOR function that are to 
be separated. 


With a constant threshold value 0 q, the 


right part of inequation 5.22 is a straight 
line through a coordinate system defined 
by the possible outputs o M und Oj 2 of the 


input neurons i\ and (fig. 5.7). 


For a (as required for inequation 5.22) pos¬ 
itive Wi 2t n the output neuron O fires for 
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5.2 Linear separability 


SLP cannot 
do everything 


n 

number of 

binary 

functions 

lin. 

separable 

ones 

share 

1 

4 

4 

100% 

2 

16 

14 

87.5% 

3 

256 

104 

40.6% 

4 

65,536 

1,772 

2.7% 

5 

4.3 • 10 9 

94,572 

0.002% 

6 

1.8 • 10 19 

5,028,134 

«0% 


Table 5.2: Number of functions concerning n bi¬ 
nary inputs, and number and proportion of the 
functions thereof which can be linearly sep arated. 
In accordance with 


Zel94 


Wid89 


Was89 



Figure 5.8: Linear separation of n = 3 inputs 
from input neurons i\, i 2 and ? 3 by 2-dimensional 
plane. 

input combinations lying above the gener¬ 
ated straight line. For a negative Wi 2j n it 
would fire for all input combinations lying 
below the straight line. Note that only the 
four corners of the unit square are possi¬ 
ble inputs because the XOR function only 
knows binary inputs. 


In order to solve the XOR problem, we 
have to turn and move the straight line so 
that input set A = {(0, 0), (1,1)} is sepa¬ 
rated from input set B = {(0,1), (1,0)} - 
this is, obviously, impossible. 


Generally, the input parameters of n many 
input neurons can be represented in an n- 
dimensional cube which is separated by an 
SLP through an (n—l)-dimensional hyper¬ 
plane (fig. 5.8). Only sets that can be sep¬ 
arated by such a hyperplane, i.e. which 
are linearly separable, can be classified 
by an SLP. 


Unfortunately, it seems that the percent¬ 
age of the linearly separable problems 
rapidly decreases with increasing n (see 
table 5.2), which limits the functionality 
of the SLP. Additionally, tests for linear 
separability are difficult. Thus, for more 
difficult tasks with more inputs we need 
something more powerful than SLP. The 
XOR problem itself is one of these tasks, 
since a perceptron that is supposed to rep¬ 
resent the XOR function already needs a 
hidden layer (fig. 5.9 on the next page). 


few tasks 
are linearly 
separable 
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more planes 



Figure 5.9: Neural network realizing the XOR 
function. Threshold values (as far as they are 
existing) are located within the neurons. 


5.3 A multilayer perceptron 
contains more trainable 
weight layers 


A perceptron with two or more trainable 
weight layers (called multilayer perceptron 
or MLP) is more powerful than an SLP. As 
we know, a singlelayer perceptron can di¬ 
vide the input space by means of a hyper¬ 
plane (in a two-dimensional input space 
by means of a straight line). A two- 
stage perceptron (two trainable weight lay¬ 
ers, three neuron layers) can classify con¬ 
vex polygons by further processing these 
straight lines, e.g. in the form "recognize 
patterns lying above straight line 1, be¬ 
low straight line 2 and below straight line 
3". Thus, we - metaphorically speaking 
- took an SLP with several output neu¬ 
rons and "attached" another SLP (upper 


part of fig. 5.10 on the facing page). A 


multilayer perceptron represents an uni¬ 
versal function approximator , which 
proven by the Theorem of Cybenko 


is 


|Cyb89 . 


Another trainable weight layer proceeds 
analogously, now with the convex poly¬ 
gons. Those can be added, subtracted or 
somehow processed with other operations 
(lower part of fig. 5.10 on the next page). 

Generally, it can be mathematically 
proven that even a multilayer perceptron 
with one layer of hidden neurons can ar¬ 
bitrarily precisely approximate functions 
with only finitely many discontinuities as 
well as their first derivatives. Unfortu¬ 
nately, this proof is not constructive and 
therefore it is left to us to find the correct 
number of neurons and weights. 


5.10 on the next page 


In the following we want to use a 
widespread abbreviated form for different 
multilayer perceptrons: We denote a two- 
stage perceptron with 5 neurons in the in¬ 
put layer, 3 neurons in the hidden layer 
and 4 neurons in the output layer as a 5- 
3-4-MLP. 


Definition 5.7 (Multilayer perceptron). 
Perceptrons with more than one layer of 
variably weighted connections are referred 
to as multilayer perceptrons (MLP). 
An n-layer or n-stage perceptron has 
thereby exactly n variable weight layers 
and n + 1 neuron layers (the retina is dis¬ 
regarded here) with neuron layer 1 being 
the input layer. 


Since three-stage perceptrons can classify 
sets of any form by combining and sepa- 


3-stage 
MLP is 
sufficient 
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5.3 The multilayer perceptron 



Y 



Y 



Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers, 
several straight lines can be combined to form convex polygons (above). By using 3 trainable 
weight layers several polygons can be formed into arbitrary sets (below). 
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n 

classifiable sets 

1 

hyperplane 

2 

convex polygon 

3 

any set 

4 

any set as well, i.e. no 


advantage 


Table 5.3: Representation of which perceptron 
can classify which types of sets with n being the 
number of trainable weight layers. 


rating arbitrarily many convex polygons, 
another step will not be advantageous 
with respect to function representations. 
Be cautious when reading the literature: 
There are many different definitions of 
what is counted as a layer. Some sources 
count the neuron layers, some count the 
weight layers. Some sources include the 
retina, some the trainable weight layers. 
Some exclude (for some reason) the out¬ 
put neuron layer. In this work, I chose 
the definition that provides, in my opinion, 
the most information about the learning 
capabilities - and I will use it cosistently. 
Remember: An ?r-stage perceptron has ex¬ 
actly n trainable weight layers. You can 
find a summary of which perceptrons can 


classify which types of sets in table 5.3 


We now want to face the challenge of train¬ 
ing perceptrons with more than one weight 
layer. 


5.4 Backpropagation of error 
generalizes the delta rule 
to allow for MLP training 


Next, I want to derive and explain the 
backpropagation of error learning rule 
(abbreviated: backpropagation, backprop 
or BP), which can be used to train multi¬ 
stage perceptrons with semi-linear 3 activa¬ 
tion functions. Binary threshold functions 
and other non-differentiable functions are 
no longer supported, but that doesn’t mat¬ 
ter: We have seen that the Fermi func¬ 
tion or the hyperbolic tangent can arbi¬ 
trarily approximate the binary threshold 
function by means of a temperature pa¬ 
rameter T. To a large extent I will fol¬ 
low the derivation according to |Zel94 and 
|MR86 . Once again I want to point out 


that this procedure had previously been 
published by Paul Werbos in Wer74 


but had consideraby less readers than in 


[MR86 . 


Backpropagation is a gradient descent pro¬ 
cedure (including all strengths and weak¬ 
nesses of the gradient descent) with the 
error function Err (IF) receiving all n 


weights as arguments (fig. 5.5 on page 78) 
and assigning them to the output error, i.e. 
being n-dimensional. On Err (IF) a point 
of small error or even a point of the small¬ 
est error is sought by means of the gradi¬ 
ent descent. Thus, in analogy to the delta 
rule, backpropagation trains the weights 
of the neural network. And it is exactly 


3 Semilinear functions are monotonous and differen¬ 
tiable - but generally they are not linear. 
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5.4 Backpropagation of error 


general¬ 
ization 
of 5 


the delta rule or its variable <5* for a neu¬ 
ron i which is expanded from one trainable 
weight layer to several ones by backpropa¬ 
gation. 


5.4.1 The derivation is similar to 

the one of the delta rule, but 
with a generalized delta 


Let us define in advance that the network 
input of the individual neurons i results 
from the weighted sum. Furthermore, as 
with the derivation of the delta rule, let 
Op.,, net P; j etc. be defined as the already 
familiar Oj, net*, etc. under the input pat¬ 
tern p we used for the training. Let the 
output function be the identity again, thus 
Oj = /act(net Pi j) holds for any neuron i. 
Since this is a generalization of the delta 
rule, we use the same formula framework 


as with flip delta rule fpnnat.inn 

5 90 nn 

page 81 

). As already indicated, 

we have 


to generalize the variable 6 for every neu¬ 
ron. 


First of all: Where is the neuron for which 
we want to calculate 51 It is obvious to 
select an arbitrary inner neuron h having 
a set K of predecessor neurons k as well 
as a set of L successor neurons l, which 
are also inner neurons (see fig. 5.11). It 
is therefore irrelevant whether the prede¬ 
cessor neurons are already the input neu¬ 
rons. 


Now we perform the same derivation as 
for the delta rule and split functions by 
means the chain rule. I will not discuss 
this derivation in great detail, but the prin¬ 
cipal is similar to that of the delta rule (the 



Figure 5.11: Illustration of the position of our 
neuron h within the neural network. It is lying in 
layer H, the preceding layer is K, the subsequent 
layer is L. 


differences are, as already mentioned, in 
the generalized 5). We initially derive the 
error function Err according to a weight 
w k ,h- 


dErr (w k ,h) 
dw k ,h. 


clErr <9net/i 
Snetft dwk,h 


(5.23) 


=-S h 


The first factor of equation 5.23 is — 5h, 


which we will deal with later in this text. 
The numerator of the second factor of the 
equation includes the network input, i.e. 
the weighted sum is included in the numer¬ 
ator so that we can immediately derive it. 
Again, all summands of the sum drop out 
apart from the summand containing w k ,h- 
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This summand is referred to as w k ^-o k . If 
we calculate the derivative, the output of 
neuron k becomes: 


According to the definition of the multi¬ 
dimensional chain rule, we immediately ob¬ 
tain equation |5.31 


dneth _ d ^2k£K k-h^k 

dw k ,h. dw k ,h 

= o k 


(5.24) 

(5.25) 


<9Err 

do h 



dneti \ 
do h ) 


(5.31) 


As promised, we will now discuss the — Sh 
of equation 5.23 on the previous page| 


which is split up again according of the 
chain rule: 


The sum in equation 5.31 contains two fac¬ 
tors. Now we want to discuss these factors 
being added over the subsequent layer L. 
We simply calculate the second factor in 


the following equation 5.33 


S h 


<9Err 


dneth 


(5.26) 

<9Err 

do h 

(5.27) 

do h 

dneth 


dneti = dJ2heH w h,i ■ Qh 
do h do h 

= w h ,i 


(5.32) 

(5.33) 


The derivation of the output according to 
the network input (the second factor in 
equation 5.27) clearly equals the deriva¬ 
tion of the activation function according 
to the network input: 


The same applies for the first factor accord¬ 
ing to the definition of our 5: 


clErr 

dneti 


(5.34) 


doh _ d/act(netfe) 
<9neth dneth 

= /act'(net/i) 


(5 28) Now we insert: 


(5.29) 


<9Err 

do h 


Y &l w h.l 

lEL 


(5.35) 


Consider this an important passage! We 
now analogously derive the first factor in 


equation 5.27. Therefore, we have to point 


out that the derivation of the error func¬ 
tion according to the output of an inner 
neuron layer depends on the vector of all 
network inputs of the next following layer. 
This is reflected in equation |5.30[ 


<9Err 

do h 


<9Err(netq,... ,net Z|i| ) 


do h 


(5.30) 


You can find a graphic version of the 5 
generalization including all splittings in 


fig. 5.12 on the facing page 


The reader might already have noticed 
that some intermediate results were shown 
in frames. Exactly those intermediate re¬ 
sults were highlighted in that way, which 
are a factor in the change in weight of 
w k) h■ If the aforementioned equations are 
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5.4 Backpropagation of error 


S h 



/act( net /0 


dErr 



dneti 

do h 


Si 


9 T, hG H W h,l-°h 

do h 


Wh,l 


Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings 
(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect the 
final results from the generalization of S, which are framed in the derivation. 
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combined with the highlighted intermedi¬ 
ate results, the outcome of this will be the 
wanted change in weight A w kj h to 

A Wk,h = V°k^h with (5.36) 

4 = /act(neth) • ^2{5i w h j) 
leL 

- of course only in case of h being an inner 
neuron (otherweise there would not be a 
subsequent layer L). 

The case of h being an output neuron has 
already been discussed during the deriva¬ 
tion of the delta rule. All in all, the re¬ 
sult is the generalization of the delta rule, 
called backpropagation of error: 

Aui k ,h = r}o k 5h with 

/act( net T) ' (h ~ Uh) ( h outside) 
/act( ne 4) ■ J2ieL(Siw h ,i) (h inside) 

(5.37) 

In contrast to the delta rule, 5 is treated 
differently depending on whether h is an 
output or an inner (i.e. hidden) neuron: 

1. If h is an output neuron, then 

&p,h = /act( ne tpi^) ' {tp t h ~ Up,h) 

(5.38) 

Thus, under our training pattern p 
the weight w k) h from k to h is changed 
proportionally according to 

> the learning rate r/, 

> the output o Pt k of the predeces¬ 
sor neuron k, 

> the gradient of the activation 
function at the position of the 
network input of the successor 
neuron /' ct (netp^) and 


> the difference between teaching 
input t Pt h and output y P) h of the 
successor neuron h. 

In this case, backpropagation is work¬ 
ing on two neuron layers, the output 
layer with the successor neuron h and 
the preceding layer with the predeces¬ 
sor neuron k. 

2. If h is an inner, hidden neuron, then 

fip,h = /act( ne ^p,/i) ' ' W h,l) 

leL 

(5.39) 

holds. I want to explicitly mention 
that backpropagation is now working 
on three layers. Here, neuron k is 
the predecessor of the connection to 
be changed with the weight w k) h-> the 
neuron h is the successor of the con¬ 
nection to be changed and the neu¬ 
rons l are lying in the layer follow¬ 
ing the successor neuron. Thus, ac¬ 
cording to our training pattern p, the 
weight w kt h from k to h is proportion¬ 
ally changed according to 

> the learning rate rj, 

> the output of the predecessor 
neuron o Pt h, 

> the gradient of the activation 
function at the position of the 
network input of the successor 
neuron ff ct (net p . h ), 

> as well as, and this is the 
difference, according to the 
weighted sum of the changes in 
weight to all neurons following h, 

' w h,l)- 


Teach. Input 
changed for 
the outer 
weight layer 


back- 

propagation 
for inner 
layers 
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5.4 Backpropagation of error 


Definition 5.8 (Backpropagation). If we 5.4.2 Heading back: Boiling 

summarize formulas |5. 38 on the preceding! backpropagation down to 


ceive the following final formula for back- 

propagation (the identifiers p are onr- As explained above, the delta rule is a 
nrited for reasons of clarity): special case of backpropagation for one- 

stage perceptrons and linear activation 
Awk h = V 0 k$h with functions - I want to briefly explain this 

/' ct (net/,.) • (t h - y h ) (h outside) circumstance and develop the delta rule 
f^Jpeth) ■ Eiedhwh,i) (h inside) out of backpropagation in order to aug- 

(5.40) 

SIMIPE: An online variant of backpro¬ 
pagation is implemented in the method 
trainBackpropagationOfError within the 
class NeuralNetwork. 


ment the understanding of both rules. We 
have seen that backpropagation is defined 

by 


A Wk,h = V°k5h with 

/' ct (netft) • (t h - y h ) (h outside) 

/act( n etfe) • £ ieL( S l w h,l) (h inside) 


S h = 



page and |5.39 on the facing page 


we re¬ 


delta rule 


backprop 
expands 
delta rule 


(5.41) 


It is obvious that backpropagation ini¬ 
tially processes the last weight layer di¬ 
rectly by means of the teaching input and 
then works backwards from layer to layer 
while considering each preceding change in 
weights. Thus, the teaching input leaves 
traces in all weight layers. Here I describe 
the first (delta rule) and the second part 
of backpropagation (generalized delta rule 
on more layers) in one go, which may meet 
the requirements of the matter but not 
of the research. The first part is obvious, 
which you will soon see in the framework 
of a mathematical gimmick. Decades of 
development time and work lie between the 
first and the second, recursive part. Like 
many groundbreaking inventions, it was 
not until its development that it was recog¬ 
nized how plausible this invention was. 


Since we only use it for one-stage percep¬ 
trons, the second part of backpropagation 
(light-colored) is omitted without substitu¬ 
tion. The result is: 

A w k ,h = VOkdh with 

= fLct( net h) • (th ~ o h ) 

Furthermore, we only want to use linear 
activation functions so that /' ct (light- 
colored) is constant. As is generally 
known, constants can be combined, and 
therefore we directly merge the constant 
derivative /' ct and (being constant for at 
least one lerning cycle) the learning rate p 
(also light-colored) in p. Thus, the result 
is: 

A w k ,h = yokdh. = pok ■ (t h - o h ) (5.43) 

This exactly corresponds to the delta rule 
definition. 
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5.4.3 The selection of the learning 5.4.3.1 Variation of the learning rate 
rate has heavy influence on over time 

the learning process 


how fast 
will be 
learned? 


v 


In the meantime we have often seen that 
the change in weight is, in any case, pro¬ 
portional to the learning rate g. Thus, the 
selection of g is crucial for the behaviour 
of backpropagation and for learning proce¬ 
dures in general. 

Definition 5.9 (Learning rate). Speed 
and accuracy of a learning procedure can 
always be controlled by and are always pro¬ 
portional to a learning rate which is writ¬ 
ten as g. 

If the value of the chosen r] is too large, 
the jumps on the error surface are also 
too large and, for example, narrow valleys 
could simply be jumped over. Addition¬ 
ally, the movements across the error sur¬ 
face would be very uncontrolled. Thus, a 
small t] is the desired input, which, how¬ 
ever, can cost a huge, often unacceptable 
amount of time. Experience shows that 
good learning rate values are in the range 
of 


During training, another stylistic device 
can be a variable learning rate : In the 
beginning, a large learning rate leads to 
good results, but later it results in inac¬ 
curate learning. A smaller learning rate 
is more time-consuming, but the result is 
more precise. Thus, during the learning 
process the learning rate needs to be de¬ 
creased by one order of magnitude once or 
repeatedly. 

A common error (which also seems to be a 
very neat solution at first glance) is to con¬ 
tinually decrease the learning rate. Here 
it quickly happens that the descent of the 
learning rate is larger than the ascent of 
a hill of the error function we are climb¬ 
ing. The result is that we simply get stuck 
at this ascent. Solution: Rather reduce 
the learning rate gradually as mentioned 
above. 


5.4.3.2 Different layers - Different 
learning rates 


0.01 < g < 0.9. 

The selection of g significantly depends on 
the problem, the network and the training 
data, so that it is barely possible to give 
practical advise. But for instance it is pop¬ 
ular to start with a relatively large rj, e.g. 
0.9, and to slowly decrease it down to 0.1. 
For simpler problems g can often be kept 
constant. 


The farer we move away from the out¬ 
put layer during the learning process, the 
slower backpropagation is learning. Thus, 
it is a good idea to select a larger learning 
rate for the weight layers close to the in¬ 
put layer than for the weight layers close 
to the output layer. 
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5.5 Resilient backpropagation 


One learning- 
rate per 
weight 

automatic 
learning rate 
adjustment 


5.5 Resilient backpropagation 
is an extension to 
backpropagation of error 


We have just raised two backpropagation- 
specific properties that can occasionally be 
a problem (in addition to those which are 
already caused by gradient descent itself): 
On the one hand, users of backpropaga¬ 
tion can choose a bad learning rate. On 
the other hand, the further the weights are 
from the output layer, the slower backpro¬ 
pagation learns. For this reason, Mar¬ 
tin Riedmiller et al. enhanced back- 
propagation and called their version re¬ 
silient backpropagation (short Rprop ) 
RB93, Rie94 . I want to compare back- 
propagation and Rprop, without explic¬ 
itly declaring one version superior to the 
other. Before actually dealing with formu¬ 
las, let us informally compare the two pri¬ 
mary ideas behind Rprop (and their con¬ 
sequences) to the already familiar backpro¬ 
pagation. 


Learning rates: Backpropagation uses by 
default a learning rate rj. which is se¬ 
lected by the user, and applies to the 
entire network. It remains static un¬ 
til it is manually changed. We have 
already explored the disadvantages of 
this approach. Here, Rprop pursues a 
completely different approach: there 
is no global learning rate. First, each 
weight Wi j has its own learning rate 
rjij, and second, these learning rates 
are not chosen by the user, but are au¬ 
tomatically set by Rprop itself. Third, 
the weight changes are not static but 


are adapted for each time step of 
Rprop. To account for the temporal 
change, we have to correctly call it 
Tji,j{t). This not only enables more 
focused learning, also the problem of 
an increasingly slowed down learning 
throughout the layers is solved in an 
elegant way. 


Weight change: When using backpropa¬ 
gation, weights are changed propor¬ 
tionally to the gradient of the error 
function. At first glance, this is really 
intuitive. However, we incorporate ev¬ 
ery jagged feature of the error surface 
into the weight changes. It is at least 
questionable, whether this is always 
useful. Here, Rprop takes other ways 
as well: the amount of weight change 
Avji.-j simply directly corresponds to 
the automatically adjusted learning 
rate r\ij. Thus the change in weight is 
not proportional to the gradient, it is 
only influenced by the sign of the gra¬ 
dient. Until now we still do not know 
how exactly the r/,j are adapted at 
run time, but let me anticipate that 
the resulting process looks consider¬ 
ably less rugged than an error func¬ 
tion. 


In contrast to backprop the weight update 
step is replaced and an additional step 
for the adjustment of the learning rate is 
added. Now how exactly are these ideas 
being implemented? 


Much 

smoother learning 
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5.5.1 Weight changes are not 

proportional to the gradient 


Definition 5.10 (Weight change in 
Rprop). 


gradient 
determines only 
direction of the 
updates 


Let us first consider the change in weight. 
We have already noticed that the weight- 
specific learning rates directly serve as ab¬ 
solute values for the changes of the re¬ 
spective weights. There remains the ques¬ 
tion of where the sign comes from - this 
is a point at which the gradient comes 
into play. As with the derivation of back- 
propagation, we derive the error function 
Err (lb) by the individual weights Wij and 
obtain gradients 9 • Now, the big 
difference: rather than multiplicatively 
incorporating the absolute value of the 
gradient into the weight change, we con¬ 
sider only the sign of the gradient. The 
gradient hence no longer determines the 
strength, but only the direction of the 
weight change. 


If the sign of the gradient is pos¬ 

itive, we must decrease the weight Wi j. 
So the weight is reduced by r/jj. If the 
sign of the gradient is negative, the weight 
needs to be increased. So is added to 
it. If the gradient is exactly 0, nothing 
happens at all. Let us now create a for¬ 
mula from this colloquial description. The 
corresponding terms are affixed with a ( t ) 
to show that everything happens at the 
same time step. This might decrease clar¬ 
ity at first glance, but is nevertheless im¬ 
portant because we will soon look at an¬ 
other formula that operates on different 
time steps. Instead, we shorten the gra¬ 
dient to: g = 


| Hg(t)>0 

if g(t) < 0 (5.44) 

I 0 otherwise. 

We now know how the weights are changed 
- now remains the question how the learn¬ 
ing rates are adjusted. Finally, once we 
have understood the overall system, we 
will deal with the remaining details like ini¬ 
tialization and some specific constants. 


5.5.2 Many dynamically adjusted 
learning rates instead of one 
static 


To adjust the learning rate rjij, we again 
have to consider the associated gradients 
g of two time steps: the gradient that has 
just passed (t — 1) and the current one 
(t). Again, only the sign of the gradient 
matters, and we now must ask ourselves: 
What can happen to the sign over two time 
steps? It can stay the same, and it can 
flip. 

If the sign changes from g{t — 1) to g(t), 
we have skipped a local minimum in the 
gradient. Hence, the last update was too 
large and rji,j{t) has to be reduced as com¬ 
pared to the previous r]ij(t — 1). One can 
say, that the search needs to be more accu¬ 
rate. In mathematical terms, we obtain a 
new r]ij(t) by multiplying the old r]i,j(t— 1) 
with a constant r/^, which is between 1 and 
0. In this case we know that in the last 
time step (t — 1) something went wrong - 


V 


i 
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5.5 Resilient backpropagation 


V 


t 


Rprop only 
learns 
offline 


hence we additionally reset the weight up¬ 
date for the weight Wij at time step (t) to 
0, so that it not applied at all (not shown 
in the following formula). 

However, if the sign remains the same, one 
can perform a (careful!) increase of rjij to 
get past shallow areas of the error function. 
Here we obtain our new rjij(t) by multiply¬ 
ing the old T]ij(t — 1) with a constant rf 
which is greater than 1. 

Definition 5.11 (Adaptation of learning 
rates in Rprop). 

f V- 1), g(t - 1 )g(t) > 0 
V= l - 1), g(t - l)g(t) < 0 

— 1) otherwise. 

(5.45) 


Caution: This also implies that Rprop is 
exclusively designed for offline. If the gra¬ 
dients do not have a certain continuity, the 
learning process slows down to the lowest 
rates (and remains there). When learning 
online, one changes - loosely speaking - 
the error function with each new epoch, 
since it is based on only one training pat¬ 
tern. This may be often well applicable 
in backpropagation and it is very often 
even faster than the offline version, which 
is why it is used there frequently. It lacks, 
however, a clear mathematical motivation, 
and that is exactly what we need here. 


5.5.3 We are still missing a few 
details to use Rprop in 
practice 

A few minor issues remain unanswered, 
namely 

1. How large are rfi and rj^ (i.e. how 
much are learning rates reinforced or 
weakened)? 

2. How to choose r/.;j(0) (i.e. how are 
the weight-specific learning rates ini¬ 
tialized)? 4 

3. What are the upper and lower bounds 

7? min and ?7 max for rjij set? 

We now answer these questions with a 
quick motivation. The initial value for the 
learning rates should be somewhere in the 
order of the initialization of the weights. 
r h.j(0) = 0.1 has proven to be a good 
choice. The authors of the Rprop paper 
explain in an obvious way that this value 
- as long as it is positive and without an ex¬ 
orbitantly high absolute value - does not 
need to be dealt with very critically, as 
it will be quickly overridden by the auto¬ 
matic adaptation anyway. 

Equally uncritical is ?y m ax, for which they 
recommend, without further mathemati¬ 
cal justification, a value of 50 which is used 
throughout most of the literature. One 
can set this parameter to lower values in 
order to allow only very cautious updates. 
Small update steps should be allowed in 
any case, so we set r] min = 10~ 6 . 

4 Protipp: since the rjij can be changed only by 
multiplication, 0 would be a rather suboptimal ini¬ 
tialization :-) 


Vmin 

7?max 
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Now we have left only the parameters rf 
and rf' 1 '. Let us start with 77 ^: If this value 
is used, we have skipped a minimum, from 
which we do not know where exactly it lies 
on the skipped track. Analogous to the 
procedure of binary search, where the tar¬ 
get object is often skipped as well, we as¬ 
sume it was in the middle of the skipped 
track. So we need to halve the learning 
rate, which is why the canonical choice 
77 ^ = 0.5 is being selected. If the value 
of rf 1 is used, learning rates shall be in¬ 
creased with caution. Here we cannot gen¬ 
eralize the principle of binary search and 
simply use the value 2 . 0 , otherwise the 
learning rate update will end up consist¬ 
ing almost exclusively of changes in direc¬ 
tion. Independent of the particular prob¬ 
lems, a value of rf = 1.2 has proven to 
be promising. Slight changes of this value 
have not significantly affected the rate of 
convergence. This fact allowed for setting 
this value as a constant as well. 


SIMIPE: I 11 Snipe resilient backpropa¬ 
gation is supported via the method 
trainResilientBackpropagation of the 
class NeuralNetwork. Furthermore, you 
can also use an additional improvement 
to resilient propagation, which is, however, 
not dealt with in this work. There are get¬ 
ters and setters for the different parameters 
of Rprop. 

5.6 Backpropagation has 

often been extended and 
altered besides Rprop 

Backpropagation has often been extended. 
Many of these extensions can simply be im¬ 
plemented as optional features of backpro¬ 
pagation in order to have a larger scope for 
testing. In the following I want to briefly 
describe some of them. 


Rprop is very 
good for 
deep networks 


With advancing computational capabili¬ 
ties of computers one can observe a more 
and more widespread distribution of net¬ 
works that consist of a big number of lay¬ 
ers, i.e. deep networks. For such net¬ 
works it is crucial to prefer Rprop over the 
original backpropagation, because back- 
prop, as already indicated, learns very 
slowly at weights wich are far from the 
output layer. For problems with a smaller 
number of layers, I would recommend test¬ 
ing the more widespread backpropagation 
(with both offline and online learning) and 
the less common Rprop equivalently. 


5.6.1 Adding momentum to 
learning 


Let us assume to descent a steep slope 
on skis - what prevents us from immedi¬ 
ately stopping at the edge of the slope 
to the plateau? Exactly - our momen¬ 
tum. With backpropagation the momen¬ 
tum term |RHW 86 b is responsible for the 
fact that a kind of moment of inertia 
(momentum ) is added to every step size 
(fig. 5.13 on the next page), by always 


adding a fraction of the previous change 
to every new change in weight: 


( j l]OW — ^l^p jdp j^eX' ()previous■ 
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5.6 Further variations and extensions to backpropagation 


Of course, this notation is only used for 
a better understanding. Generally, as al¬ 
ready defined by the concept of time, when 
referring to the current cycle as (t), then 
the previous cycle is identified by (t — 1), 
which is continued successively. And now 
we come to the formal definition of the mo¬ 
mentum term: 

Definition 5.12 (Momentum term). The 

moment of 

inertia variation of backpropagation by means of 
the momentum term is defined as fol¬ 
lows: 



a 


A Wij(t) = rjOidj + a ■ Awij(t — 1) (5.46) 

We accelerate on plateaus (avoiding quasi¬ 
standstill on plateaus) and slow down on 
craggy surfaces (preventing oscillations). 
Moreover, the effect of inertia can be var¬ 
ied via the prefactor ct, common val¬ 
ues are between 0.6 und 0.9. Addition¬ 
ally, the momentum enables the positive 
effect that our skier swings back and 
forth several times in a minimum, and fi¬ 
nally lands in the minimum. Despite its 
nice one-dimensional appearance, the oth¬ 
erwise very rare error of leaving good min¬ 
ima unfortunately occurs more frequently 
because of the momentum term - which 
means that this is again no optimal solu¬ 
tion (but we are by now accustomed to 
this condition). 


5.6.2 Flat spot elimination prevents 
neurons from getting stuck 

It must be pointed out that with the hy¬ 
perbolic tangent as well as with the Fermi 


Figure 5.13: We want to execute the gradient 
descent like a skier crossing a slope, who would 
hardly stop immediately at the edge to the 
plateau. 


function the derivative outside of the close 
proximity of 0 is nearly 0. This results 
in the fact that it becomes very difficult 
to move neurons away from the limits of 
the activation (flat spots), which could ex¬ 
tremely extend the learning time. This 
problem can be dealt with by modifying 
the derivative, for example by adding a 
constant (e.g. 0.1), which is called flat 
spot elimination or - more colloquial - 
fudging. 


It is an interesting observation, that suc¬ 
cess has also been achieved by using deriva¬ 
tives defined as constants |Fah88:. A nice 
example making use of this effect is the 
fast hyperbolic tangent approximation by 
Anguita et al. introduced in section |3.2.6| 


on page 37. In the outer regions of it’s (as 


neurons 
get stuck 
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well approximated and accelerated) deriva- 5.6.4 Weight decay: Punishment of 
tive, it makes use of a small constant. large weights 


5.6.3 The second derivative can be 
used, too 


According to David Parker |Par87, 
Second order backpropagation also us- 
ese the second gradient, i.e. the second 
multi-dimensional derivative of the error 
function, to obtain more precise estimates 
of the correct A Wij. Even higher deriva¬ 
tives only rarely improve the estimations. 
Thus, less training cycles are needed but 
those require much more computational ef¬ 
fort. 


In general, we use further derivatives (i.e. 
Hessian matrices, since the functions are 
multidimensional) for higher order meth¬ 
ods. As expected, the procedures reduce 
the number of learning epochs, but signifi¬ 
cantly increase the computational effort of 
the individual epochs. So in the end these 
procedures often need more learning time 
than backpropagation. 


The quickpropagation 

dure |Fah88 


learning proce- 
uses the second derivative of 


the error propagation and locally under¬ 
stands the error function to be a parabola. 
We analytically determine the vertex (i.e. 
the lowest point) of the said parabola and 
directly jump to this point. Thus, this 
learning procedure is a second-order proce¬ 
dure. Of course, this does not work with 
error surfaces that cannot locally be ap¬ 
proximated by a parabola (certainly it is 
not always possible to directly say whether 
this is the case). 


The weight decay according to Paul 
Werbos |Wer88 is a modification that ex¬ 
tends the error by a term punishing large 
weights. So the error under weight de¬ 
cay 

Err WD 


does not only increase proportionally to 
the actual error but also proportionally to 
the square of the weights. As a result the 
network is keeping the weights small dur¬ 
ing learning. 


Err WD = Err + /3 ■ ™) 2 (5.47) 

1 wew 

s -V-' 

punishment 


This approach is inspired by nature where 
synaptic weights cannot become infinitely 
strong as well. Additionally, due to these 
small weights, the error function often 
shows weaker fluctuations, allowing easier 
and more controlled learning. 

The prefactor \ again resulted from sim¬ 
ple pragmatics. The factor /3 controls the 
strength of punishment: Values from 0.001 
to 0.02 are often used here. 


5.6.5 Cutting networks down: 

Pruning and Optimal Brain 
Damage 

If we have executed the weight decay long 
enough and notice that for a neuron in 
the input layer all successor weights are 
0 or close to 0, we can remove the neuron, 


Err WD 


keep weights 
small 


P 


prune the 
network 
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5.7 Initial configuration of a multilayer perceptron 


hence losing this neuron and some weights 
and thereby reduce the possibility that the 
network will memorize. This procedure is 
called pruning. 

Such a method to detect and delete un¬ 
necessary weights and neurons is referred 
to as optimal brain damage [1CDS90 . 
I only want to describe it briefly: The 
mean error per output neuron is composed 
of two competing terms. While one term, 
as usual, considers the difference between 
output and teaching input, the other one 
tries to "press" a weight towards 0. If a 
weight is strongly needed to minimize the 
error, the first term will win. If this is not 
the case, the second term will win. Neu¬ 
rons which only have zero weights can be 
pruned again in the end. 

There are many other variations of back- 
prop and whole books only about this 
subject, but since my aim is to offer an 
overview of neural networks, I just want 
to mention the variations above as a moti¬ 
vation to read on. 

For some of these extensions it is obvi¬ 
ous that they cannot only be applied to 
feedforward networks with backpropaga- 
tion learning procedures. 

We have gotten to know backpropagation 
and feedforward topology - now we have 
to learn how to build a neural network. It 
is of course impossible to fully communi¬ 
cate this experience in the framework of 
this work. To obtain at least some of 
this knowledge, I now advise you to deal 
with some of the exemplary problems from 
14.61 


5.7 Getting started - Initial 
configuration of a 
multilayer perceptron 

After having discussed the backpropaga¬ 
tion of error learning procedure and know¬ 
ing how to train an existing network, it 
would be useful to consider how to imple¬ 
ment such a network. 

5.7.1 Number of layers: Two or 
three may often do the job, 
but more are also used 

Let us begin with the trivial circumstance 
that a network should have one layer of in¬ 
put neurons and one layer of output neu¬ 
rons, which results in at least two layers. 

Additionally, we need - as we have already 
learned during the examination of linear 
separability — at least one hidden layer of 
neurons, if our problem is not linearly sep¬ 
arable (which is, as we have seen, very 
likely). 

It is possible, as already mentioned, to 
mathematically prove that this MLP with 
one hidden neuron layer is already capable 
of approximating arbitrary functions with 
any accuracy 5 - but it is necessary not 
only to discuss the representability of a 
problem by means of a perceptron but also 
the learnability. Representability means 
that a perceptron can, in principle, realize 

5 Note: We have not indicated the number of neu¬ 
rons in the hidden layer, we only mentioned the 
hypothetical possibility. 
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a mapping - but learnability means that 
we are also able to teach it. 

In this respect, experience shows that two 
hidden neuron layers (or three trainable 
weight layers) can be very useful to solve 
a problem, since many problems can be 
represented by a hidden layer but are very 
difficult to learn. 

One should keep in mind that any ad¬ 
ditional layer generates additional sub¬ 
minima of the error function in which we 
can get stuck. All these things consid¬ 
ered, a promising way is to try it with 
one hidden layer at first and if that fails, 
retry with two layers. Only if that fails, 
one should consider more layers. However, 
given the increasing calculation power of 
current computers, deep networks with 
a lot of layers are also used with success. 


5.7.2 The number of neurons has 
to be tested 


The number of neurons (apart from input 
and output layer, where the number of in¬ 
put and output neurons is already defined 
by the problem statement) principally cor¬ 
responds to the number of free parameters 
of the problem to be represented. 

Since we have already discussed the net¬ 
work capacity with respect to memorizing 
or a too imprecise problem representation, 
it is clear that our goal is to have as few 
free parameters as possible but as many as 
necessary. 

But we also know that there is no stan¬ 
dard solution for the question of how many 


neurons should be used. Thus, the most 
useful approach is to initially train with 
only a few neurons and to repeatedly train 
new networks with more neurons until the 
result significantly improves and, particu¬ 
larly, the generalization performance is not 
affected ( bottom-up approach). 


5.7.3 Selecting an activation 
function 

Another very important parameter for the 
way of information processing of a neural 
network is the selection of an activa¬ 
tion function. The activation function 
for input neurons is fixed to the identity 
function, since they do not process infor¬ 
mation. 

The first question to be asked is whether 
we actually want to use the same acti¬ 
vation function in the hidden layer and 
in the ouput layer - no one prevents us 
from choosing different functions. Gener¬ 
ally, the activation function is the same for 
all hidden neurons as well as for the output 
neurons respectively. 

For tasks of function approximation it 
has been found reasonable to use the hy- 
tangent (left part of fig. 

) as activation function of the hid¬ 
den neurons, while a linear activation func¬ 
tion is used in the output. The latter is 
absolutely necessary so that we do not gen¬ 
erate a limited output intervall. Contrary 
to the input layer which uses linear acti¬ 
vation functions as well, the output layer 
still processes information, because it has 


perbolic 


page 102 


5.14 on 


100 


D. Kriesel - A Brief Introduction to Neural Networks (ZETA2-EN) 






dkriesel.com 


5.8 The 8-3-8 encoding problem and related problems 


random 

initial 

weights 


threshold values. However, linear activa¬ 
tion functions in the output can also cause 
huge learning steps and jumping over good 
minima in the error surface. This can be 


range of random values could be the in¬ 
terval [—0.5; 0.5] not including 0 or values 
very close to 0. This random initialization 
has a nice side effect: Chances are that 


avoided by setting the learning rate to very 
small values in the output layer. 


An unlimited output interval is not essen¬ 
tial for pattern recognition tasks 6 * . If 
the hyperbolic tangent is used in any case, 
the output interval will be a bit larger. Un¬ 
like with the hyperbolic tangent, with the 


Fermi function (right part of fig. 5.14 on 


the following page) it is difficult to learn 


something far from the threshold value 
(where its result is close to 0). However, 
here a lot of freedom is given for selecting 
an activation function. But generally, the 
disadvantage of sigmoid functions is the 
fact that they hardly learn something for 
values far from thei threshold value, unless 
the network is modified. 


the average of network inputs is close to 0, 
a value that hits (in most activation func¬ 
tions) the region of the greatest derivative, 
allowing for strong learning impulses right 
from the start of learning. 


SIM I PE: In Snipe, weights are initial¬ 
ized randomly (if a synapse initial¬ 
ization is wanted). The maximum 
absolute weight value of a synapse 
initialized at random can be set in 
a NeuralNetworkDescriptor using the 
method setSynapselnitialRange. 


5.8 The 8-3-8 encoding 
problem and related 
problems 


5.7.4 Weights should be initialized 
with small, randomly chosen 
values 

The initialization of weights is not as triv¬ 
ial as one might think. If they are simply 
initialized with 0, there will be no change 
in weights at all. If they are all initialized 
by the same value, they will all change 
equally during training. The simple so¬ 
lution of this problem is called symme¬ 
try breaking, which is the initialization 
of weights with small random values. The 

6 Generally, pattern recognition is understood as a 

special case of function approximation with a few 

discrete output possibilities. 


The 8-3-8 encoding problem is a clas¬ 
sic among the multilayer perceptron test 
training problems. In our MLP we 
have an input layer with eight neurons 
Hi*2j • • ■ j*8> an output layer with eight 
neurons Oi, O 2 , - - •, and one hidden 
layer with three neurons. Thus, this net¬ 
work represents a function B 8 —> B 8 . Now 
the training task is that an input of a value 
1 into the neuron ij should lead to an out¬ 
put of a value 1 from the neuron f ij (only 
one neuron should be activated, which re¬ 
sults in 8 training samples. 

During the analysis of the trained network 
we will see that the network with the 3 
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Hyperbolic Tangent Fermi Function with Temperature Parameter 




x 


x 


Figure 5.14: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function 
(right). The Fermi function was expanded by a temperature parameter. The original Fermi function 
is thereby represented by dark colors, the temperature parameter of the modified Fermi functions 
are, ordered ascending by steepness, 1, 1, Y ar| d jjg- 


hidden neurons represents some kind of bi¬ 
nary encoding and that the above map¬ 
ping is possible (assumed training time: 
~ 10 4 epochs). Thus, our network is a ma¬ 
chine in which the input is first encoded 
and afterwards decoded again. 

Analogously, we can train a 1024-10-1024 
encoding problem. But is it possible to 
improve the efficiency of this procedure? 
Could there be, for example, a 1024-9- 
1024- or an 8-2-8-encoding network? 


Yes, even that is possible, since the net¬ 
work does not depend on binary encodings: 
Thus, an 8-2-8 network is sufficient for our 
problem. But the encoding of the network 


is far more difficult to understand (fig. 5.15 


on the next page) and the training of the 


networks requires a lot more time. 


SIMIPE: The static method 

getEncoderSampleLesson in the class 
TrainingSampleLesson allows for creating 
simple training sample lessons of arbitrary 


dimensionality for encoder problems like 
the above. 

An 8-1-8 network, however, does not work, 
since the possibility that the output of one 
neuron is compensated by another one is 
essential, and if there is only one hidden 
neuron, there is certainly no compensatory 
neuron. 


Exercises 


Exercise 8. Fig. 5.4 on page 75 shows 
a small network for the boolean functions 
AND and DR. Write tables with all computa¬ 
tional parameters of neural networks (e.g. 
network input, activation etc.). Perform 
the calculations for the four possible in¬ 
puts of the networks and write down the 
values of these variables for each input. Do 


the same for the X0R network (fig. 5.9 on 


page 84). 
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5.8 The 8-3-8 encoding problem and related problems 



Figure 5.15: Illustration of the functionality of 
8-2-8 network encoding. The marked points rep¬ 
resent the vectors of the inner neuron activation 
associated to the samples. As you can see, it 
is possible to find inner activation formations so 
that each point can be separated from the rest 
of the points by a straight line. The illustration 
shows an exemplary separation of one point. 


Exercise 9. 

1. List all boolean functions B 3 —>• B 1 , 
that are linearly separable and char¬ 
acterize them exactly. 

2. List those that are not linearly sepa¬ 
rable and characterize them exactly, 
too. 

Exercise 10. A simple 2-1 network shall 
be trained with one single pattern by 
means of backpropagation of error and 
g = 0.1. Verify if the error 

Err = Err p = ^(t - y) 2 

converges and if so, at what value. How 
does the error curve look like? Let the 
pattern (p,t) be defined by p = {p \, p-z) = 
(0.3, 0.7) and tci = 0.4. Randomly initalize 
the weights in the interval [1; —1]. 

Exercise 11. A one-stage perceptron 
with two input neurons, bias neuron 
and binary threshold function as activa¬ 
tion function divides the two-dimensional 
space into two regions by means of a 
straight line g. Analytically calculate a 
set of weight values for such a perceptron 
so that the following set P of the 6 pat¬ 
terns of the form (pi,p 2 , t(i) with e <C 1 is 
correctly classified. 


P ={( 0 , 0 , — 1 ); 

( 2 ,- 1 , 1 ); 

(7 + e, 3 — e, 1); 
(7 — e, 3 + e, —1); 
( 0,-2 — e , 1 ); 

(0 — £, — 2 , — 1 )} 
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Exercise 12. Calculate in a comprehen¬ 
sible way one vector AW of all changes in 
weight by means of the backpropagation of 
error procedure with rj = 1. Let a 2-2-1 
MLP with bias neuron be given and let the 
pattern be defined by 

P = (pi,P 2 ,tn) = (2,0,0.1). 

For all weights with the target 12 the ini¬ 
tial value of the weights should be 1. For 
all other weights the initial value should 
be 0.5. What is conspicuous about the 
changes? 
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Radial basis functions 


RBF networks approximate functions by stretching and compressing Gaussian 
bells and then summing them spatially shifted. Description of their functions 
and their learning process. Comparison with multilayer perceptrons. 


According to POGGIO and Girosi |PG89 
radial basis function networks (RBF net¬ 
works) are a paradigm of neural networks, 
which was developed considerably later 
than that of perceptrons. Like percep¬ 
trons, the RBF networks are built in layers. 
But in this case, they have exactly three 
layers, i.e. only one single layer of hidden 
neurons. 

Like perceptrons, the networks have a 
feedforward structure and their layers are 
completely linked. Here, the input layer 
again does not participate in information 
processing. The RBF networks are - 
like MLPs - universal function approxima¬ 
tors. 

Despite all things in common: What is the 
difference between RBF networks and per¬ 
ceptrons? The difference lies in the infor¬ 
mation processing itself and in the compu¬ 
tational rules within the neurons outside 
of the input layer. So, in a moment we 
will define a so far unknown type of neu¬ 
rons. 


6.1 Components and 
structure of an RBF 
network 

Initially, we want to discuss colloquially 
and then define some concepts concerning 
RBF networks. 

Output neurons: In an RBF network the 
output neurons only contain the iden¬ 
tity as activation function and one 
weighted sum as propagation func¬ 
tion. Thus, they do little more than 
adding all input values and returning 
the sum. 

Hidden neurons are also called RBF neu¬ 
rons (as well as the layer in which 
they are located is referred to as RBF 
layer). As propagation function, each 
hidden neuron calculates a norm that 
represents the distance between the 
input to the network and the so-called 
position of the neuron (center). This 
is inserted into a radial activation 
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input 
is linear 
again 


C 

Position 
in the input 
space 


Important! 


only sums 
up 


function which calculates and outputs 
the activation of the neuron. 


Definition 6.1 (RBF input neuron). Def¬ 
inition and representation is identical to 
the definition 5.1 on page 73 of the input 


neuron. 


Definition 6.2 (Center of an RBF neu¬ 
ron). The center Ch of an RBF neuron 
h is the point in the input space where 
the RBF neuron is located . In general, 
the closer the input vector is to the center 
vector of an RBF neuron, the higher is its 
activation. 


Definition 6.3 (RBF neuron). The so- 
called RBF neurons h have a propaga¬ 
tion function / prop that determines the dis¬ 
tance between the center Ch of a neuron 
and the input vector y. This distance rep¬ 
resents the network input. Then the net¬ 
work input is sent through a radial basis 
function / ac t which returns the activation 
or the output of the neuron. RBF neurons 


are represented by the symbol 



Definition 6.4 (RBF output neuron). 
RBF output neurons D use the 

weighted sum as propagation function 
/prop j and the identity as activation func¬ 
tion / ac t. They are represented by the sym- 



Definition 6.5 (RBF network). An 
RBF network has exactly three layers in 
the following order: The input layer con¬ 
sisting of input neurons, the hidden layer 
(also called RBF layer) consisting of RBF 
neurons and the output layer consisting of 


RBF output neurons. Each layer is com¬ 
pletely linked with the following one, short¬ 
cuts do not exist (fig. 6.1 on the next page) 
- it is a feedforward topology. The connec¬ 
tions between input layer and RBF layer 
are unweighted, i.e. they only transmit 
the input. The connections between RBF 
layer and output layer are weighted. The 
original definition of an RBF network only 
referred to an output neuron, but - in anal¬ 
ogy to the perceptrons - it is apparent that 
such a definition can be generalized. A 
bias neuron is not used in RBF networks. 
The set of input neurons shall be repre¬ 
sented by I, the set of hidden neurons by 
H and the set of output neurons by O. 


Therefore, the inner neurons are called ra¬ 
dial basis neurons because from their def¬ 
inition follows directly that all input vec¬ 
tors with the same distance from the cen¬ 
ter of a neuron also produce the same out¬ 
put value (fig. 


6.2 on page 108 


6.2 Information processing of 
an RBF network 


Now the question is, what can be realized 
by such a network and what is its purpose. 
Let us go over the RBF network from top 
to bottom: An RBF network receives the 
input by means of the unweighted con¬ 
nections. Then the input vector is sent 
through a norm so that the result is a 
scalar. This scalar (which, by the way, can 
only be positive due to the norm) is pro¬ 
cessed by a radial basis function, for exam- 


3 layers, 
feedforward 


I,H,0 
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6.2 Information processing of an RBF network 



Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and three 
output neurons. The connections to the hidden neurons are not weighted, they only transmit the 
input. Right of the illustration you can find the names of the neurons, which coincide with the 
names of the MLP neurons: Input neurons are called i, hidden neurons are called h and output 
neurons are called ft. The associated sets are referred to as I, H and O. 
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ging, compressing and removing Gaussian 
bells and subsequently accumulating them. 
Here, the parameters for the superposition 
of the Gaussian bells are in the weights 
of the connections between the RBF layer 
and the output layer. 

Furthermore, the network architecture of¬ 
fers the possibility to freely define or train 
height and width of the Gaussian bells - 
due to which the network paradigm be¬ 
comes even more versatile. We will get 
to know methods and approches for this 

__► later. 

o i 

'2 

Figure 6.2: Let Cf, be the center of an RBF neu¬ 
ron h. Then the activation function f acth is ra- 6.2.1 Information processing in 
dially symmetric around c h . RBF neurons 



input 
—> distance 
Gaussian bell 
—> sum 
—> output 


pie by a Gaussian bell (fig. 
pageP . 

The output values of the different neurons 
of the RBF layer or of the different Gaus¬ 
sian bells are added within the third layer: 
basically, in relation to the whole input 
space, Gaussian bells are added here. 

Suppose that we have a second, a third 
and a fourth RBF neuron and therefore 
four differently located centers. Each of 
these neurons now measures another dis¬ 
tance from the input to its own center 
and de facto provides different values, even 
if the Gaussian bell is the same. Since 
these values are finally simply accumu¬ 
lated in the output layer, one can easily 
see that any surface can be shaped by drag- 


6.3 on the next 


RBF neurons process information by using 
norms and radial basis functions 


At first, let us take as an example a sim¬ 
ple 1 - 4-1 RBF network. It is apparent 
that we will receive a one-dimensional out¬ 
put which can be represented as a func¬ 
tion (fig. 6.4 on the facing page). Ad¬ 
ditionally, the network includes the cen¬ 
ters ci, C2,..., C4 of the four inner neurons 
hi, h,2, ■ ■ ■, /t4, and therefore it has Gaus¬ 
sian bells which are finally added within 
the output neuron G. The network also 
possesses four values <7i, <72,..., <74 which 
influence the width of the Gaussian bells. 
On the contrary, the height of the Gaus¬ 
sian bell is influenced by the subsequent 
weights, since the individual output val¬ 
ues of the bells are multiplied by those 
weights. 
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6.2 Information processing of an RBF network 


Gaussian in ID 


h(r) 


Gaussian in 2D 




r 


Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases a = 0.4 holds 
and the centers of the Gaussian bells lie in the coordinate origin. The distance r to the center (0, 0) 
is simply calculated according to the Pythagorean theorem: r = \fx 2 + y 2 . 



Figure 6.4: Four different Gaussian bells in one-dimensional space generated by means of RBF 
neurons are added by an output neuron of the RBF network. The Gaussian bells have different 
heights, widths and positions. Their centers Ci,C 2 ,...,C 4 are located at 0,1,3,4, the widths 
(jj, o~ 2 ,..., (74 at 0.4,1,0.2, 0.8. You can see a two-dimensional example in fig. |6.5 on the following! 
page 
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Figure 6.5: Four different Gaussian bells in two-dimensional space generated by means of RBF 
neurons are added by an output neuron of the RBF network. Once again r = \/x 1 + y 2 applies for 
the distance. The heights w, widths a and centers c = (x, y) are: w± = 1, <j\ = 0.4, c\ = (0.5,0.5), 
W 2 = — 1,02 = 0.6,C 2 = (1.15,-1.15), W 3 = 1.5 ,<73 = 0.2,C 3 = (—0.5,—1), W 4 — 0.8 ,<74 = 
1.4, c 4 = (- 2 , 0 ). 
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6.2 Information processing of an RBF network 


Since we use a norm to calculate the dis¬ 
tance between the input vector and the 
center of a neuron h, we have different 
choices: Often the Euclidian norm is cho¬ 
sen to calculate the distance: 

r h = \\x~c h \\ (6.1) 

( x i - c h,i ) 2 ( 6 . 2 ) 

iei 

Remember: The input vector was referred 
to as x. Here, the index i runs through 
the input neurons and thereby through the 
input vector components and the neuron 
center components. As we can see, the 
Euclidean distance generates the squared 
differences of all vector components, adds 
them and extracts the root of the sum. 
In two-dimensional space this corresponds 
to the Pythagorean theorem. From the 
definition of a norm directly follows that 
the distance can only be positive. Strictly 
speaking, we hence only use the positive 
part of the activation function. By the 
way, activation functions other than the 
Gaussian bell are possible. Normally, func¬ 
tions that are monotonically decreasing 
over the interval [0; oo] are chosen. 

Now that we know the distance rh be¬ 
tween the input vector x and the center 
Ch of the RBF neuron h, this distance has 
to be passed through the activation func¬ 
tion. Here we use, as already mentioned, 
a Gaussian bell: 


activation function / ac t, and hence the ac¬ 
tivation functions should not be referred 
to as / act simultaneously. One solution 
would be to number the activation func¬ 
tions like /act i, /act 2 , • ■ • • /act I i/ 1 with H be¬ 
ing the set of hidden neurons. But as a 
result the explanation would be very con¬ 
fusing. So I simply use the name / ac t for 
all activation functions and regard a and 
c as variables that are defined for individ¬ 
ual neurons but no directly included in the 
activation function. 

The reader will certainly notice that in the 
literature the Gaussian bell is often nor¬ 
malized by a multiplicative factor. We 
can, however, avoid this factor because 
we are multiplying anyway with the subse¬ 
quent weights and consecutive multiplica¬ 
tions, first by a normalization factor and 
then by the connections’ weights, would 
only yield different factors there. We do 
not need this factor (especially because for 
our purpose the integral of the Gaussian 
bell must not always be 1) and therefore 
simply leave it out. 

6.2.2 Some analytical thoughts 
prior to the training 

The output uq of an RBF output neuron 
H results from combining the functions of 
an RBF neuron to 


/act 0/0 = e 



(6.3) 


yn = Wh P ■ •Oct (10 - . (6.4) 

h&H 


It is obvious that both the center Ch and Suppose that similar to the multilayer per- 
the width ah can be seen as part of the ceptron we have a set P, that contains |P| 
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training samples (p, t ). Then we obtain 
|P| functions of the form 

yn = w h ,n • /act (||p - c^ll) , (6.5) 

h£H 

i.e. one function for each training sam¬ 
ple. 

Of course, with this effort we are aiming 
at letting the output y for all training 
patterns p converge to the corresponding 
teaching input t. 


6.2.2.1 Weights can simply be 

computed as solution of a 
system of equations 


Thus, we have \P\ equations. Now let us 
assume that the widths 0i, 02 ,..., 0fc, the 
centers ci, C 2 ,..., c\~ and the training sam¬ 
ples p including the teaching input t are 
given. We are looking for the weights vj^si 
with \H\ weights for one output neuron 
17. Thus, our problem can be seen as a 
system of equations since the only thing 
we want to change at the moment are the 
weights. 

This demands a distinction of cases con¬ 
cerning the number of training samples |P| 
and the number of RBF neurons I Pi: 


|P| = \H\: If the number of RBF neurons 
equals the number of patterns, i.e. 
\P\ = |P|, the equation can be re¬ 
duced to a matrix multiplication 

simply 

calculate 

weights 


T = M ■ G (6.6) 

4^ M~ l • T = AT 1 ■ M ■ G (6.7) 

<f> Af _1 • T = E ■ G (6.8) 

4 ^ AT 1 T = G, (6.9) 

where 

> T is the vector of the teaching 
inputs for all training samples, 

> Af is the |P| x \H\ matrix of 
the outputs of all \H\ RBF neu¬ 
rons to |P| samples (remember: 
|P| = |P|, the matrix is squared 
and we can therefore attempt to 
invert it), 

> G is the vector of the desired 
weights and 

> E is a unit matrix with the same 
size as G. 

Mathematically speaking, we can sim¬ 
ply calculate the weights: In the case 
of |P| = \H\ there is exactly one RBF 
neuron available per training sample. 
This means, that the network exactly 
meets the |P| existing nodes after hav¬ 
ing calculated the weights, i.e. it per¬ 
forms a precise interpolation. To 
calculate such an equation we cer¬ 
tainly do not need an RBF network, 
and therefore we can proceed to the 
next case. 

Exact interpolation must not be mis¬ 
taken for the memorizing ability men¬ 
tioned with the MLPs: First, we are 
not talking about the training of RBF 
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6.2 Information processing of an RBF network 


networks at the moment. Second, 
it could be advantageous for us and 
might in fact be intended if the net¬ 
work exactly interpolates between the 
nodes. 

P | < \H\: The system of equations is 
under-determined, there are more 
RBF neurons than training samples, 
i.e. \P\ //1. Certainly, this case 

normally does not occur very often. 
In this case, there is a huge variety 
of solutions which we do not need in 
such detail. We can select one set of 
weights out of many obviously possi¬ 
ble ones. 

P | > \H\: But most interesting for fur¬ 
ther discussion is the case if there 
are significantly more training sam¬ 
ples than RBF neurons, that means 
\P\ > \H\. Thus, we again want 
to use the generalization capability of 
the neural network. 

If we have more training samples than 
RBF neurons, we cannot assume that 
every training sample is exactly hit. 
So, if we cannot exactly hit the points 
and therefore cannot just interpolate 
as in the aforementioned ideal case 
with |P| = \H\, we must try to find 
a function that approximates our 
training set P as closely as possible: 
As with the MLP we try to reduce 
the sum of the squared error to a min¬ 
imum. 


have to find the solution M of a ma¬ 
trix multiplication 

T = M G. (6.10) 


The problem is that this time we can¬ 
not invert the |P| x \H\ matrix M be¬ 
cause it is not a square matrix (here, 
|P| ^ \H\ is true). Here, we have 
to use the Moore-Penrose pseudo 
inverse M + which is defined by 

M+ = (M t ■ M)- * 1 • M t (6.11) 


Although the Moore-Penrose pseudo 
inverse is not the inverse of a matrix, 
it can be used similarly in this case 1 . 
We get equations that are very similar 
to those in the case of |P| = \H\: 

T = M ■ G (6.12) 

M + ■ T = M + ■ M ■ G (6.13) 
4=> M + T = E G (6.14) 

<S> M + - T = G (6.15) 


Another reason for the use of the 
Moore-Penrose pseudo inverse is the 
fact that it minimizes the squared 
error (which is our goal): The esti¬ 


mate of the vector G in equation 6.15 


corresponds to the Gauss-Markov 
model known from statistics, which 
is used to minimize the squared error. 


In the aforementioned equations 6.11 


and the following ones please do not 
mistake the T in M T (of the trans¬ 
pose of the matrix M ) for the T of 
the vector of all teaching inputs. 


How do we continue the calculation 
in the case of |P| > \H \? As above, 
to solve the system of equations, we 


1 Particularly, M + = is true if M is invertible. 

I do not want to go into detail of the reasons for 
these circumstances and applications of M + - they 
can easily be found in literature for linear algebra. 
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inexpensive 

output 

dimension 


6.2.2.2 The generalization on several 
outputs is trivial and not quite 
computationally expensive 

We have found a mathematically exact 
way to directly calculate the weights. 
What will happen if there are several out¬ 
put neurons, i.e. |0| > 1, with O being, as 
usual, the set of the output neurons 17? In 
this case, as we have already indicated, it 
does not change much: The additional out¬ 
put neurons have their own set of weights 
while we do not change the a and c of the 
RBF layer. Thus, in an RBF network it is 
easy for given a and c to realize a lot of 
output neurons since we only have to cal¬ 
culate the individual vector of weights 

Gq = M + ■ Th (6.16) 

for every new output neuron 17, whereas 
the matrix M + , which generally requires 
a lot of computational effort, always stays 
the same: So it is quite inexpensive - at 
least concerning the computational com¬ 
plexity - to add more output neurons. 

6.2.2.3 Computational effort and 
accuracy 

For realistic problems it normally applies 
that there are considerably more training 
samples than RBF neurons, i.e. |P| 

\H\: You can, without any difficulty, use 
10 6 training samples, if you like. Theoreti¬ 
cally, we could find the terms for the math¬ 
ematically correct solution on the black¬ 
board (after a very long time), but such 
calculations often seem to be imprecise 


and very time-consuming (matrix inver¬ 
sions require a lot of computational ef¬ 
fort). 

Furthermore, our Moore-Penrose pseudo¬ 
inverse is, in spite of numeric stabil¬ 
ity, no guarantee that the output vector 
corresponds to the teaching vector, be¬ 
cause such extensive computations can be 
prone to many inaccuracies, even though 
the calculation is mathematically correct: 
Our computers can only provide us with 
(nonetheless very good) approximations of 
the pseudo-inverse matrices. This means 
that we also get only approximations of 
the correct weights (maybe with a lot of 
accumulated numerical errors) and there¬ 
fore only an approximation (maybe very 
rough or even unrecognizable) of the de¬ 
sired output. 

If we have enough computing power to an¬ 
alytically determine a weight vector, we 
should use it nevertheless only as an initial 
value for our learning process, which leads 
us to the real training methods - but oth¬ 
erwise it would be boring, wouldn’t it? 

6.3 Combinations of equation 
system and gradient 
strategies are useful for 
training 

Analogous to the MLP we perform a gra¬ 
dient descent to find the suitable weights 
by means of the already well known delta 
rule. Here, backpropagation is unneces¬ 
sary since we only have to train one single 


M+ complex 
and imprecise 


retraining 
delta rule 
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6.3 Training of RBF networks 


training 
in phases 


weight layer - which requires less comput¬ 
ing time. 

We know that the delta rule is 

Aw h ,n = v ■ 6 q ■ o h , (6.17) 

in which we now insert as follows: 

A w h ,n = ?? • (tn - yn) ■ /act(||p - c h \\) 

(6.18) 


Here again I explicitly want to mention 
that it is very popular to divide the train¬ 
ing into two phases by analytically com¬ 
puting a set of weights and then refining 
it by training with the delta rule. 

There is still the question whether to learn 
offline or online. Here, the answer is sim¬ 
ilar to the answer for the multilayer per- 
ceptron: Initially, one often trains online 
(faster movement across the error surface). 
Then, after having approximated the so¬ 
lution, the errors are once again accumu¬ 
lated and, for a more precise approxima¬ 
tion, one trains offline in a third learn¬ 
ing phase. However, similar to the MLPs, 
you can be successful by using many meth¬ 
ods. 

As already indicated, in an RBF network 
not only the weights between the hidden 
and the output layer can be optimized. So 
let us now take a look at the possibility to 
vary a and c. 


6.3.1 It is not always trivial to 
determine centers and widths 
of RBF neurons 

It is obvious that the approximation accu¬ 
racy of RBF networks can be increased by 
adapting the widths and positions of the 
Gaussian bells in the input space to the 
problem that needs to be approximated. 
There are several methods to deal with the 
centers c and the widths a of the Gaussian 
bells: 

Fixed selection: The centers and widths 
can be selected in a fixed manner and 
regardless of the training samples - 
this is what we have assumed until 
now. 

Conditional, fixed selection: Again cen¬ 
ters and widths are selected fixedly, 
but we have previous knowledge 
about the functions to be approxi¬ 
mated and comply with it. 

Adaptive to the learning process: This 
is definitely the most elegant variant, 
but certainly the most challenging 
one, too. A realization of this 
approach will not be discussed in 
this chapter but it can be found in 
connection with another network 
topology (section [lO. 6.1 ). 

6.3.1.1 Fixed selection 

In any case, the goal is to cover the in¬ 
put space as evenly as possible. Here, 
widths of | of the distance between the 


vary 
(7 and c 
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Figure 6.6: Example for an even coverage of a 
two-dimensional input space by applying radial 
basis functions. 


responsible for the fact that six- to ten¬ 
dimensional problems in RBF networks 
are already called "high-dimensional" (an 
MLP, for example, does not cause any 
problems here). 


6.3.1.2 Conditional, fixed selection 

Suppose that our training samples are not 
evenly distributed across the input space. 
It then seems obvious to arrange the cen¬ 
ters and sigmas of the RBF neurons by 
means of the pattern distribution. So the 
training patterns can be analyzed by statis¬ 
tical techniques such as a cluster analysis , 
and so it can be determined whether there 
are statistical factors according to which 
we should distribute the centers and sig¬ 
mas (fig. 


6.7 on the facing page 


input 
dimension 
very expensive 


centers can be selected so that the Gaus¬ 
sian bells overlap by approx, "one third" 2 
(fig. 6.6). The closer the bells are set the 
more precise but the more time-consuming 
the whole thing becomes. 


This may seem to be very inelegant, but 
in the field of function approximation we 
cannot avoid even coverage. Here it is 
useless if the function to be approximated 
is precisely represented at some positions 
but at other positions the return value is 
only 0. However, the high input dimen¬ 
sion requires a great many RBF neurons, 
which increases the computational effort 
exponentially with the dimension - and is 


2 It is apparent that a Gaussian bell is mathemati¬ 
cally infinitely wide, therefore I ask the reader to 
apologize this sloppy formulation. 


A more trivial alternative would be to 
set \H\ centers on positions randomly se¬ 
lected from the set of patterns. So this 
method would allow for every training pat¬ 
tern p to be directly in the center of a neu¬ 
ron (fig. 6.8 on the next page). This is 
not yet very elegant but a good solution 
when time is an issue. Generally, for this 
method the widths are fixedly selected. 

If we have reason to believe that the set 
of training samples is clustered, we can 
use clustering methods to determine them. 
There are different methods to determine 
clusters in an arbitrarily dimensional set 
of points. We will be introduced to some 
of them in excursus El One neural cluster¬ 
ing method are the so-called ROLFs (sec¬ 
tion A.5), and self-organizing maps are 
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6.3 Training of RBF networks 



Figure 6.7: Example of an uneven coverage of 
a two-dimensional input space, of which we 
have previous knowledge, by applying radial ba¬ 
sis functions. 


also useful in connection with determin¬ 
ing the position of RBF neurons (section 


10.6.1). Using ROLFs, one can also receive 


indicators for useful radii of the RBF neu¬ 
rons. Learning vector quantisation (chap¬ 
ter [9]) has also provided good results. All 
these methods have nothing to do with 
the RBF networks themselves but are only 
used to generate some previous knowledge. 
Therefore we will not discuss them in this 
chapter but independently in the indicated 
chapters. 



Figure 6.8: Example of an uneven coverage of 
a two-dimensional input space by applying radial 
basis functions. The widths were fixedly selected, 
the centers of the neurons were randomly dis¬ 
tributed throughout the training patterns. This 
distribution can certainly lead to slightly unrepre¬ 
sentative results, which can be seen at the single 
data point down to the left. 


Another approach is to use the approved 
methods: We could slightly move the po¬ 
sitions of the centers and observe how our 
error function Err is changing - a gradient 
descent, as already known from the MLPs. 
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In a similar manner we could look how the 
error depends on the values a. Analogous 
to the derivation of backpropagation we 
derive 

dErr(o h c h ) <9Err (a h c h ) 

cinci 

dcj h dc h 

Since the derivation of these terms corre¬ 
sponds to the derivation of backpropaga¬ 
tion we do not want to discuss it here. 

But experience shows that no convincing 
results are obtained by regarding how the 
error behaves depending on the centers 
and sigmas. Even if mathematics claim 
that such methods are promising, the gra¬ 
dient descent, as we already know, leads 
to problems with very craggy error sur¬ 
faces. 

And that is the crucial point: Naturally, 
RBF networks generate very craggy er¬ 
ror surfaces because, if we considerably 
change a c or a < 7 , we will significantly 
change the appearance of the error func¬ 
tion. 

6.4 Growing RBF networks 
automatically adjust the 
neuron density 

In growing RBF networks , the number 
\H\ of RBF neurons is not constant. A 
certain number \H\ of neurons as well as 
their centers Ch and widths are previ¬ 
ously selected (e.g. by means of a cluster¬ 
ing method) and then extended or reduced. 


In the following text, only simple mecha¬ 
nisms are sketched. For more information, 
I refer to |Fri94 . 

6.4.1 Neurons are added to places 
with large error values 

After generating this initial configuration 
the vector of the weights G is analytically 
calculated. Then all specific errors Err p 
concerning the set P of the training sam¬ 
ples are calculated and the maximum spe¬ 
cific error 

max(EiTp) 

is sought. 

The extension of the network is simple: 
We replace this maximum error with a new 
RBF neuron. Of course, we have to exer¬ 
cise care in doing this: IF the a are small, 
the neurons will only influence each other 
if the distance between them is short. But 
if the a are large, the already exisiting 
neurons are considerably influenced by the 
new neuron because of the overlapping of 
the Gaussian bells. 

So it is obvious that we will adjust the al¬ 
ready existing RBF neurons when adding 
the new neuron. 

To put it simply, this adjustment is made 
by moving the centers c of the other neu¬ 
rons away from the new neuron and re¬ 
ducing their width o a bit. Then the 
current output vector y of the network is 
compared to the teaching input t and the 
weight vector G is improved by means of 
training. Subsequently, a new neuron can 
be inserted if necessary. This method is 


replace 
error with 
neuron 
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6.5 Comparing RBF networks and multilayer perceptrons 


delete 

unimportant 

neurons 


particularly suited for function approxima¬ 
tions. 

6.4.2 Limiting the number of 
neurons 

Here it is mandatory to see that the net¬ 
work will not grow ad infinitum, which can 
happen very fast. Thus, it is very useful 
to previously define a maximum number 
for neurons |Hj max . 

6.4.3 Less important neurons are 
deleted 

Which leads to the question whether it 
is possible to continue learning when this 
limit | H |max is reached. The answer is: 
this would not stop learning. We only have 
to look for the "most unimportant" neuron 
and delete it. A neuron is, for example, 
unimportant for the network if there is an¬ 
other neuron that has a similar function: 
It often occurs that two Gaussian bells ex¬ 
actly overlap and at such a position, for 
instance, one single neuron with a higher 
Gaussian bell would be appropriate. 

But to develop automated procedures in 
order to find less relevant neurons is highly 
problem dependent and we want to leave 
this to the programmer. 

With RBF networks and multilayer per¬ 
ceptrons we have already become ac¬ 
quainted with and extensivley discussed 
two network paradigms for similar prob¬ 
lems. Therefore we want to compare these 


two paradigms and look at their advan¬ 
tages and disadvantages. 

6.5 Comparing RBF networks 
and multilayer 
perceptrons 

We will compare multilayer perceptrons 
and RBF networks with respect to differ¬ 
ent aspects. 

Input dimension: We must be careful 
with RBF networks in high- 
dimensional functional spaces since 
the network could very quickly 
require huge memory storage and 
computational effort. Here, a 
multilayer perceptron would cause 
less problems because its number of 
neuons does not grow exponentially 
with the input dimension. 

Center selection: However, selecting the 
centers c for RBF networks is (despite 
the introduced approaches) still a ma¬ 
jor problem. Please use any previous 
knowledge you have when applying 
them. Such problems do not occur 
with the MLP. 

Output dimension: The advantage of 
RBF networks is that the training is 
not much influenced when the output 
dimension of the network is high. 
For an MLP, a learning procedure 
such as backpropagation thereby will 
be very time-consuming. 

Extrapolation: Advantage as well as dis¬ 
advantage of RBF networks is the lack 
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Important! 


of extrapolation capability: An RBF 
network returns the result 0 far away 
from the centers of the RBF layer. On 
the one hand it does not extrapolate, 
unlike the MLP it cannot be used 
for extrapolation (whereby we could 
never know if the extrapolated values 
of the MLP are reasonable, but expe¬ 
rience shows that MLPs are suitable 
for that matter). On the other hand, 
unlike the MLP the network is capa¬ 
ble to use this 0 to tell us "I don’t 
know", which could be an advantage. 

Lesion tolerance: For the output of an 
MLP, it is no so important if a weight 
or a neuron is missing. It will only 
worsen a little in total. If a weight 
or a neuron is missing in an RBF net¬ 
work then large parts of the output 
remain practically uninfluenced. But 
one part of the output is heavily af¬ 
fected because a Gaussian bell is di¬ 
rectly missing. Thus, we can choose 
between a strong local error for lesion 
and a weak but global error. 


Exercises 

Exercise 13. An |/|-|U|-|0| RBF net¬ 
work with fixed widths and centers of the 
neurons should approximate a target func¬ 
tion u. For this, |P| training samples of 
the form (p, t ) of the function u are given. 
Let |P| > \H\ be true. The weights should 
be analytically determined by means of 
the Moore-Penrose pseudo inverse. Indi¬ 
cate the running time behavior regarding 
|P| and \0\ as precisely as possible. 

Note: There are methods for matrix mul¬ 
tiplications and matrix inversions that are 
more efficient than the canonical methods. 
For better estimations, I recommend to 
look for such methods (and their complex¬ 
ity) . In addition to your complexity calcu¬ 
lations, please indicate the used methods 
together with their complexity. 


Spread: Here the MLP is "advantaged" 
since RBF networks are used consid¬ 
erably less often - which is not always 
understood by professionals (at least 
as far as low-dinrensional input spaces 
are concerned). The MLPs seem to 
have a considerably longer tradition 
and they are working too good to take 
the effort to read some pages of this 
work about RBF networks) :-). 
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Chapter 7 

Recurrent perceptron-like networks 


Some thoughts about networks with internal states. 


more capable 
than MLP 


Generally, recurrent networks are net¬ 
works that are capable of influencing them¬ 
selves by means of recurrences , e.g. by 
including the network output in the follow¬ 
ing computation steps. There are many 
types of recurrent networks of nearly arbi¬ 
trary form, and nearly all of them are re¬ 
ferred to as recurrent neural networks. 
As a result, for the few paradigms in¬ 
troduced here I use the name recurrent 
multilayer perceptrons. 

Apparently, such a recurrent network is ca¬ 
pable to compute more than the ordinary 
MLP: If the recurrent weights are set to 0, 
the recurrent network will be reduced to 
an ordinary MLP. Additionally, the recur¬ 
rence generates different network-internal 
states so that different inputs can produce 
different outputs in the context of the net¬ 
work state. 

Recurrent networks in themselves have a 
great dynamic that is mathematically dif¬ 
ficult to conceive and has to be discussed 
extensively. The aim of this chapter is 
only to briefly discuss how recurrences can 


be structured and how network-internal 
states can be generated. Thus, I will 
briefly introduce two paradigms of recur¬ 
rent networks and afterwards roughly out¬ 
line their training. 


With a recurrent network an input x that 
is constant over time may lead to differ¬ 
ent results: On the one hand, the network 
could converge, i.e. it could transform it¬ 
self into a fixed state and at some time re¬ 
turn a fixed output value y. On the other 
hand, it could never converge, or at least 
not until a long time later, so that it can 
no longer be recognized, and as a conse¬ 
quence, y constantly changes. 


If the network does not converge, it is, for 
example, possible to check if periodicals 


or attractors (fig. 7.1 on the following 


page) are returned. Here, we can expect 


the complete variety of dynamical sys¬ 
tems. That is the reason why I particu¬ 
larly want to refer to the literature con¬ 
cerning dynamical systems. 


state 

dynamics 
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Chapter 7 Recurrent perceptron-like networks (depends on 


chapter^ 


Further discussions could reveal what will 
happen if the input of recurrent networks 
is changed. 

In this chapter the related paradigms of 
recurrent networks according to JORDAN 
and Elman will be introduced. 



Figure 7.1: The Roessler attractor 


7.1 Jordan networks 


A Jordan network |Jor86 is a multi¬ 
layer perceptron with a set I\ of so-called 
context neurons fci, fa,..., k\K\- There 
is one context neuron per output neuron 
(fig. 7.2 on the next page). In principle, a 
context neuron just memorizes an output 
until it can be processed in the next time 
step. Therefore, there are weighted con¬ 
nections between each output neuron and 
one context neuron. The stored values are 
returned to the actual network by means 
of complete links between the context neu¬ 
rons and the input layer. 


output 
neurons 
are buffered 


In the originial definition of a Jordan net¬ 
work the context neurons are also recur¬ 
rent to themselves via a connecting weight 
A. But most applications omit this recur¬ 
rence since the Jordan network is already 
very dynamic and difficult to analyze, even 
without these additional recurrences. 


Definition 7.1 (Context neuron). Aeon- 
text neuron k receives the output value of 
another neuron i at a time t and then reen¬ 
ters it into the network at a time (t + 1). 

Definition 7.2 (Jordan network). A Jor¬ 
dan network is a multilayer perceptron 
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Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neurons 
and with the next time step it is entered into the network together with the new input. 


with one context neuron per output neu¬ 
ron. The set of context neurons is called 
K . The context neurons are completely 
linked toward the input layer of the net¬ 
work. 


during the next time step (i.e. again a com¬ 
plete link on the way back). So the com¬ 
plete information processing part 1 of the 
MLP exists a second time as a "context 
version" - which once again considerably 
increases dynamics and state variety. 


nearly every¬ 
thing is 
buffered 


7.2 Elman networks 


The Elman networks (a variation of 
the Jordan networks) |Elm90 have con¬ 
text neurons, too, but one layer of context 
neurons per information processing neu¬ 
ron layer (fig. |7.3 on the following page I. 


Thus, the outputs of each hidden neuron 
or output neuron are led into the associ¬ 
ated context layer (again exactly one con¬ 
text neuron per neuron) and from there it 
is reentered into the complete neuron layer 


Compared with Jordan networks the El¬ 
man networks often have the advantage to 
act more purposeful since every layer can 
access its own context. 

Definition 7.3 (Elman network). An El¬ 
man network is an MLP with one con¬ 
text neuron per information processing 
neuron. The set of context neurons is 
called K. This means that there exists one 
context layer per information processing 

1 Remember: The input layer does not process in¬ 
formation. 
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Figure 7.3: Illustration of an Elman network. The entire information processing part of the network 
exists, in a way, twice. The output of each neuron (except for the output of the input neurons) 
is buffered and reentered into the associated layer. For the reason of clarity I named the context 
neurons on the basis of their models in the actual network, but it is not mandatory to do so. 


neuron layer with exactly the same num¬ 
ber of context neurons. Every neuron has 
a weighted connection to exactly one con¬ 
text neuron while the context layer is com¬ 
pletely linked towards its original layer. 


Now it is interesting to take a look at the 
training of recurrent networks since, for in¬ 
stance, ordinary backpropagation of error 
cannot work on recurrent networks. Once 
again, the style of the following part is 
rather informal, which means that I will 
not use any formal definitions. 


7.3 Training recurrent 
networks 

In order to explain the training as compre¬ 
hensible as possible, we have to agree on 
some simplifications that do not affect the 
learning principle itself. 

So for the training let us assume that in 
the beginning the context neurons are ini¬ 
tiated with an input, since otherwise they 
would have an undefined input (this is no 
simplification but reality). 

Furthermore, we use a Jordan network 
without a hidden neuron layer for our 
training attempts so that the output neu- 
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7.3 Training recurrent networks 


rons can directly provide input. This ap¬ 
proach is a strong simplification because 
generally more complicated networks are 
used. But this does not change the learn¬ 
ing principle. 


7.3.1 Unfolding in time 

Remember our actual learning procedure 
for MLPs, the backpropagation of error, 
which backpropagates the delta values. 
So, in case of recurrent networks the 
delta values would backpropagate cycli¬ 
cally through the network again and again, 
which makes the training more difficult. 
On the one hand we cannot know which 
of the many generated delta values for a 
weight should be selected for training, i.e. 
which values are useful. On the other hand 
we cannot definitely know when learning 
should be stopped. The advantage of re¬ 
current networks are great state dynamics 
within the network; the disadvantage of 
recurrent networks is that these dynamics 
are also granted to the training and there¬ 
fore make it difficult. 

One learning approach would be the at¬ 
tempt to unfold the temporal states of 
the network (fig. 7.4 on the next page): 
Recursions are deleted by putting a sim¬ 
ilar network above the context neurons, 
i.e. the context neurons are, as a man¬ 
ner of speaking, the output neurons of 
the attached network. More generally spo¬ 
ken, we have to backtrack the recurrences 
and place "‘earlier"’ instances of neurons 
in the network - thus creating a larger, 


7.4 on the next page 


but forward-oriented network without re¬ 
currences. This enables training a recur¬ 
rent network with any training strategy 
developed for non-recurrent ones. Here, 
the input is entered as teaching input into 
every "copy" of the input neurons. This 
can be done for a discrete number of time 
steps. These training paradigms are called 
unfolding in time |MP69 . After the un¬ 
folding a training by means of backpropa¬ 
gation of error is possible. 


But obviously, for one weight Wij sev¬ 
eral changing values A Wij are received, 
which can be treated differently: accumu¬ 
lation, averaging etc. A simple accumu¬ 
lation could possibly result in enormous 
changes per weight if all changes have the 
same sign. Hence, also the average is not 
to be underestimated. We could also intro¬ 
duce a discounting factor, which weakens 
the influence of A w^j of the past. 

Unfolding in time is particularly useful if 
we receive the impression that the closer 
past is more important for the network 
than the one being further away. The 
reason for this is that backpropagation 
has only little influence in the layers far¬ 
ther away from the output (remember: 
the farther we are from the output layer, 
the smaller the influence of backpropaga¬ 
tion). 

Disadvantages: the training of such an un¬ 
folded network will take a long time since 
a large number of layers could possibly be 
produced. A problem that is no longer 
negligible is the limited computational ac¬ 
curacy of ordinary computers, which is 
exhausted very fast because of so many 


attach 
the same 
network 
to each 
context 
layer 


D. Kriesel - A Brief Introduction to Neural Networks (ZETA2-EN) 


125 








dkriesel.com 


Chapter 7 Recurrent perceptron-like networks (depends on 


chapter^ 




Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: The 
recurrent MLP. Bottom: The unfolded network. For reasons of clarity, I only added names to 
the lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs. 
Dotted arrows leading out of the network mark the outputs. Each "network copy" represents a time 
step of the network with the most recent time step being at the bottom. 
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7.3 Training recurrent networks 


nested computations (the farther we are 
from the output layer, the smaller the in¬ 
fluence of backpropagation, so that this 
limit is reached). Furthermore, with sev¬ 
eral levels of context neurons this proce¬ 
dure could produce very large networks to 
be trained. 

7.3.2 Teacher forcing 


are chosen suitably: So, for example, neu¬ 
rons and weights can be adjusted and 
the network topology can be optimized 
(of course the result of learning is not 
necessarily a Jordan or Elman network). 
With ordinary MLPs, however, evolution¬ 
ary strategies are less popular since they 
certainly need a lot more time than a di¬ 
rected learning procedure such as backpro¬ 
pagation. 


teaching 
input 
applied at 
context 
neurons 


Other procedures are the equivalent 
teacher forcing and open loop learn¬ 
ing. They detach the recurrence during 
the learning process: We simply pretend 
that the recurrence does not exist and ap¬ 
ply the teaching input to the context neu¬ 
rons during the training. So, backpropaga¬ 
tion becomes possible, too. Disadvantage: 
with Elman networks a teaching input for 
non-output-neurons is not given. 


7.3.3 Recurrent backpropagation 

Another popular procedure without lim¬ 
ited time horizon is the recurrent back- 
propagation using methods of differ¬ 
ential calculus to solve the problem 
|Pin87 . 

7.3.4 Training with evolution 

Due to the already long lasting train¬ 
ing time, evolutionary algorithms have 
proved to be of value, especially with recur¬ 
rent networks. One reason for this is that 
they are not only unrestricted with respect 
to recurrences but they also have other ad¬ 
vantages when the mutation mechanisms 
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Chapter 8 

Hopfield networks 


In a magnetic field, each particle applies a force to any other particle so that 
all particles adjust their movements in the energetically most favorable way. 
This natural mechanism is copied to adjust noisy inputs in order to match 

their real models. 


Another supervised learning example of 
the wide range of neural networks was 
developed by .John Hopfield: the so- 
called Hopfield networks |Hop82j. Hop- 
field and his physically motivated net¬ 
works have contributed a lot to the renais¬ 
sance of neural networks. 


8.1 Hopfield networks are 
inspired by particles in a 
magnetic field 


The idea for the Hopfield networks origi¬ 
nated from the behavior of particles in a 
magnetic field: Every particle "communi¬ 
cates" (by means of magnetic forces) with 
every other particle (completely linked) 
with each particle trying to reach an ener¬ 
getically favorable state (i.e. a minimum 
of the energy function ). As for the neurons 
this state is known as activation. Thus, 
all particles or neurons rotate and thereby 


encourage each other to continue this rota¬ 
tion. As a manner of speaking, our neural 
network is a cloud of particles 

Based on the fact that the particles auto¬ 
matically detect the minima of the energy 
function, Hopfield had the idea to use the 
"spin" of the particles to process informa¬ 
tion: Why not letting the particles search 
minima on arbitrary functions? Even if we 
only use two of those spins, i.e. a binary 
activation, we will recognize that the devel¬ 
oped Hopfield network shows considerable 
dynamics. 


8.2 In a hopfield network, all 
neurons influence each 
other symmetrically 


Briefly speaking, a Hopfield network con¬ 
sists of a set K of completely linked neu¬ 
rons with binary activation (since we only 


K 
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Figure 8.1: Illustration of an exemplary Hop- 
field network. The arrows f and f, mark the 
binary "spin". Due to the completely linked neu¬ 
rons the layers cannot be separated, which means 
that a Hopfield network simply includes a set of 
neurons. 


Definition 8.1 (Hopfield network). A 
Hopfield network consists of a set K of 
completely linked neurons without direct 
recurrences. The activation function of 
the neurons is the binary threshold func¬ 
tion with outputs £ {1,-1}. 

Definition 8.2 (State of a Hopfield net¬ 
work). The state of the network con¬ 
sists of the activation states of all neu¬ 
rons. Thus, the state of the network can 
be understood as a binary string z £ 
{- 1 , 1 }'*'. 

8.2.1 Input and output of a 
Hopfield network are 
represented by neuron states 


completely 
linked 
set of 
neurons 


use two spins), with the weights being 
symmetric between the individual neurons 
and without any neuron being directly con¬ 
nected to itself (fig. 8.1). Thus, the state 
of 11C | neurons with two possible states 
£ {—1,1} can be described by a string 

x £ { —1,1}I A I. 


The complete link provides a full square 
matrix of weights between the neurons. 
The meaning of the weights will be dis¬ 
cussed in the following. Furthermore, we 
will soon recognize according to which 
rules the neurons are spinning, i.e. are 
changing their state. 


Additionally, the complete link leads to 
the fact that we do not know any input, 
output or hidden neurons. Thus, we have 
to think about how we can input some¬ 
thing into the |1C| neurons. 


We have learned that a network, i.e. a 
set of 11C| particles, that is in a state 
is automatically looking for a minimum. 
An input pattern of a Hopfield network 
is exactly such a state: A binary string 
x £ {—1,1}^ that initializes the neurons. 
Then the network is looking for the min¬ 
imum to be taken (which we have previ¬ 
ously defined by the input of training sam¬ 
ples) on its energy surface. 

But when do we know that the minimum 
has been found? This is simple, too: when 
the network stops. It can be proven that a 
Hopfield network with a symmetric weight 
matrix that has zeros on its diagonal al¬ 
ways converges |CG88] , i.e. at some point 
it will stand still. Then the output is a 
binary string y £ { — 1,1}I^I, namely the 
state string of the network that has found 
a minimum. 


input and 
output = 
network 
states 


always 

converges 
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8.2 Structure and functionality 


Now let us take a closer look at the con¬ 
tents of the weight matrix and the rules 
for the state change of the neurons. 

Definition 8.3 (Input and output of 
a Hopfield network). The input of a 
Hopfield network is binary string x E 
{—1, that initializes the state of the 
network. After the convergence of the 
network, the output is the binary string 
y E { — 1,1}^ generated from the new net¬ 
work state. 


Zero weights lead to the two involved 
neurons not influencing each other. 

The weights as a whole apparently take 
the way from the current state of the net¬ 
work towards the next minimum of the en¬ 
ergy function. We now want to discuss 
how the neurons follow this way. 

8.2.3 A neuron changes its state 

according to the influence of 
the other neurons 


8.2.2 Significance of weights 

We have already said that the neurons 
change their states, i.e. their direction, 
from —1 to 1 or vice versa. These spins oc¬ 
cur dependent on the current states of the 
other neurons and the associated weights. 
Thus, the weights are capable to control 
the complete change of the network. The 
weights can be positive, negative, or 0. 
Colloquially speaking, for a weight Wij be¬ 
tween two neurons i and j the following 
holds: 

If Wij is positive, it will try to force the 
two neurons to become equal - the 
larger they are, the harder the net¬ 
work will try. If the neuron i is in 
state 1 and the neuron j is in state 
—1, a high positive weight will advise 
the two neurons that it is energeti¬ 
cally more favorable to be equal. 


Once a network has been trained and 
initialized with some starting state, the 
change of state Xk of the individual neu¬ 
rons k occurs according to the scheme 


Xk(t) = /act w x.k ■ x j{t ~ 1) (8.1) 

Vie* / 

in each time step, where the function / act 
generally is the binary threshold function 


(fig. 8.2 on the next page) with threshold 
0. Colloquially speaking: a neuron k cal¬ 
culates the sum of Wj t k • Xj(t — 1), which 
indicates how strong and into which direc¬ 
tion the neuron k is forced by the other 
neurons j. Thus, the new state of the net¬ 
work (time t) results from the state of the 
network at the previous time t — 1. This 
sum is the direction into which the neuron 
k is pushed. Depending on the sign of the 
sum the neuron takes state 1 or — 1. 


If Wij is negative, its behavior will be 
analoguous only that i and j are 
urged to be different. A neuron i in 
state —1 would try to urge a neuron 
j into state 1. 


Another difference between Hopfield net¬ 
works and other already known network 
topologies is the asynchronous update: A 
neuron k is randomly chosen every time, 
which then recalculates the activation. 
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new 

activation 


Heaviside Function 



x 


Figure 8.2: Illustration of the binary threshold 
function. 


Thus, the new activation of the previously 
changed neurons immediately influences 
the network, i.e. one time step indicates 
the change of a single neuron. 

Regardless of the aforementioned random 
selection of the neuron, a Hopfield net¬ 
work is often much easier to implement: 
The neurons are simply processed one af¬ 
ter the other and their activations are re¬ 
calculated until no more changes occur. 

Definition 8.4 (Change in the state of 
a Hopfield network). The change of state 
of the neurons occurs asynchronously with 
the neuron to be updated being randomly 
chosen and the new state being generated 
by means of this rule: 


a minimum, then there is the question of 
how to teach the weights to force the net¬ 
work towards a certain minimum. 

8.3 The weight matrix is 
generated directly out of 
the training patterns 

The aim is to generate minima on the 
mentioned energy surface, so that at an 
input the network can converge to them. 
As with many other network paradigms, 
we use a set P of training patterns p G 
{1, — 1}^, representing the minima of our 
energy surface. 

Unlike many other network paradigms, we 
do not look for the minima of an unknown 
error function but define minima on such a 
function. The purpose is that the network 
shall automatically take the closest min¬ 
imum when the input is presented. For 
now this seems unusual, but we will un¬ 
derstand the whole purpose later. 

Roughly speaking, the training of a Hop- 
field network is done by training each train¬ 
ing pattern exactly once using the rule 
described in the following ( Single Shot 
Learning ), where pi and pj are the states 
of the neurons i and j under pSP: 


Xk(t) = /act w xk ' x j( t - !) • 

W / 

Now that we know how the weights influ¬ 
ence the changes in the states of the neu¬ 
rons and force the entire network towards 


w i,j = J2P i 'Pj ( 8 - 2 ) 

peP 

This results in the weight matrix W. Col¬ 
loquially speaking: We initialize the net¬ 
work by means of a training pattern and 
then process weights w l)3 one after another. 
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8.4 Autoassociation and traditional application 


For each of these weights we verify: Are 
the neurons i,j n the same state or do the 
states vary? In the first case we add 1 
to the weight, in the second case we add 
- 1 . 

This we repeat for each training pattern 
ptP. Finally, the values of the weights 
Wij are high when i and j corresponded 
with many training patterns. Colloquially 
speaking, this high value tells the neurons: 
"Often, it is energetically favorable to hold 
the same state". The same applies to neg¬ 
ative weights. 

Due to this training we can store a certain 
fixed number of patterns p in the weight 
matrix. At an input x the network will 
converge to the stored pattern that is clos¬ 
est to the input p. 


0.139 • | K | training samples can be trained 
and at the same time maintain their func¬ 
tion. 

Now we know the functionality of Hopfield 
networks but nothing about their practical 


Hopfield networks, like those mentioned 
above, are called autoassociators. An 
autoassociator a exactly shows the afore¬ 
mentioned behavior: Firstly, when a 
known pattern p is entered, exactly this 
known pattern is returned. Thus, 


use. 

8.4 Autoassociation and 
traditional application 


Unfortunately, the number of the maxi¬ 
mum storable and reconstructible patterns 
p is limited to 

|-P|MAX ~ 0.139 • \K\, (8.3) 

which in turn only applies to orthogo¬ 
nal patterns. This was shown by precise 
(and time-consuming) mathematical anal¬ 
yses, which we do not want to specify 
now. If more patterns are entered, already 
stored information will be destroyed. 

Definition 8.5 (Learning rule for Hop- 
field networks). The individual elements 
of the weight matrix W are defined by a 
single processing of the learning rule 

w i.j = J2Pi‘Pj, 
peP 

where the diagonal of the matrix is covered 
with zeros. Here, no more than |P|max ~ 


a{p) = p, 

with a being the associative mapping. Sec¬ 
ondly, and that is the practical use, this 
also works with inputs that are close to a 
pattern: 

a(p + e) = p. 

Afterwards, the autoassociator is, in any 
case, in a stable state, namely in the state 
P- 

If the set of patterns P consists of, for ex¬ 
ample, letters or other characters in the 
form of pixels, the network will be able to 
correctly recognize deformed or noisy let¬ 
ters with high probability (fig. 
following page). 

The primary fields of application of Hop- 
field networks are pattern recognition 
and pattern completion, such as the zip 


8.3 on the 


network 

restores 

damaged 

inputs 
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Figure 8.3: Illustration of the convergence of an 
exemplary Hopfield network. Each of the pic¬ 
tures has 10 x 12 = 120 binary pixels. In the 
Hopfield network each pixel corresponds to one 
neuron. The upper illustration shows the train¬ 
ing samples, the lower shows the convergence of 
a heavily noisy 3 to the corresponding training 
sample. 


code recognition on letters in the eighties. 
But soon the Hopfield networks were re¬ 
placed by other systems in most of their 
fields of application, for example by OCR 
systems in the field of letter recognition. 
Today Hopfield networks are virtually no 
longer used, they have not become estab¬ 
lished in practice. 


8.5 Heteroassociation and 
analogies to neural data 
storage 

So far we have been introduced to Hopfield 
networks that converge from an arbitrary 
input into the closest minimum of a static 
energy surface. 

Another variant is a dynamic energy sur¬ 
face: Here, the appearance of the energy 
surface depends on the current state and 
we receive a heteroassociator instead of 
an autoassociator. For a heteroassocia- 
tor 

a(p + e) = p 

is no longer true, but rather 
h(p + e) = q, 

which means that a pattern is mapped 
onto another one. h is the heteroasso- 
ciative mapping. Such heteroassociations 
are achieved by means of an asymmetric 
weight matrix V. 
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8.5 Heteroassociation and analogies to neural data storage 


V 

v 


<1 


netword 
is instable 
while 
changing 
states 


Heteroassociations connected in series of 
the form 

h(p + e) = q 
h(q + e) = r 
h(r + e) = s 

h(z + e) = p 

can provoke a fast cycle of states 

p—>■(/—»r—>-p, 

whereby a single pattern is never com¬ 
pletely accepted: Before a pattern is en¬ 
tirely completed, the heteroassociation al¬ 
ready tries to generate the successor of this 
pattern. Additionally, the network would 
never stop, since after having reached the 
last state z, it would proceed to the first 
state p again. 

8.5.1 Generating the 

heteroassociative matrix 

We generate the matrix V by means of el¬ 
ements v very similar to the autoassocia- 
tive matrix with p being (per transition) 
the training sample before the transition 
and q being the training sample to be gen¬ 
erated from p: 

v i,j = Y P i( lj ( 84 ) 

The diagonal of the matrix is again filled 
with zeros. The neuron states are, as al¬ 
ways, adapted during operation. Several 
transitions can be introduced into the ma¬ 
trix by a simple addition, whereby the said 
limitation exists here, too. 


Definition 8.6 (Learning rule for the het¬ 
eroassociative matrix). For two training 
samples p being predecessor and q being 
successor of a heteroassociative transition 
the weights of the heteroassociative matrix 
V result from the learning rule 

v i,j = Y PW’ 

p,q£P,p¥=q 

with several heteroassociations being intro¬ 
duced into the network by a simple addi¬ 
tion. 


8.5.2 Stabilizing the 

heteroassociations 


We have already mentioned the problem 
that the patterns are not completely gen¬ 
erated but that the next pattern is already 
beginning before the generation of the pre¬ 
vious pattern is finished. 

This problem can be avoided by not only 
influencing the network by means of the 
heteroassociative matrix V but also by 
the already known autoassociative matrix 
W. 

Additionally, the neuron adaptation rule 
is changed so that competing terms are 
generated: One term autoassociating an 
existing pattern and one term trying to 
convert the very same pattern into its suc¬ 
cessor. The associative rule provokes that 
the network stabilizes a pattern, remains 
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there for a while, goes on to the next pat- Which letter in the alphabet follows the 

tern, and so on. letter P? 


Xi(t+ 1) = 

( 


(8.5) 

\ 


fact 


Y Wi,jXj{t) + Y v i,kXk(t. - At) 
j&K k&K 


V 


autoassociation 


heteroassociation 


7 


Another example is the phenomenon that 
one cannot remember a situation, but the 
place at which one memorized it the last 
time is perfectly known. If one returns 
to this place, the forgotten situation often 
comes back to mind. 


At 

stable change 
in states 


Here, the value At causes, descriptively 
speaking, the influence of the matrix V 
to be delayed, since it only refers to a 
network being At versions behind. The 
result is a change in state, during which 
the individual states are stable for a short 
while. If At is set to, for example, twenty 
steps, then the asymmetric weight matrix 
will realize any change in the network only 
twenty steps later so that it initially works 
with the autoassociative matrix (since it 
still perceives the predecessor pattern of 
the current one), and only after that it will 
work against it. 

8.5.3 Biological motivation of 
heterassociation 

From a biological point of view the transi¬ 
tion of stable states into other stable states 
is highly motivated: At least in the begin¬ 
ning of the nineties it was assumed that 
the Hopfield modell will achieve an ap¬ 
proximation of the state dynamics in the 
brain, which realizes much by means of 
state chains: When I would ask you, dear 
reader, to recite the alphabet, you gener¬ 
ally will manage this better than (please 
try it immediately) to answer the follow¬ 
ing question: 


8.6 Continuous Hopfield 
networks 


So far, we only have discussed Hopfield net¬ 
works with binary activations. But Hop- 
field also described a version of his net¬ 
works with continuous activations [|Hop84 ], 
which we want to cover at least briefly: 
continuous Hopfield networks. Here, 
the activation is no longer calculated by 
the binary threshold function but by the 
Fermi function with temperature parame¬ 
ters (fig. 8.4 on the next page). 

Here, the network is stable for symmetric 
weight matrices with zeros on the diagonal, 
too. 


8.4 on the next page 


Hopfield also stated, that continuous Hop- 
field networks can be applied to find ac¬ 
ceptable solutions for the NP-hard trav¬ 
elling salesman problem [HT85 . Accord¬ 
ing to some verification trials |Zel94 this 
statement can’t be kept up any more. But 
today there are faster algorithms for han¬ 
dling this problem and therefore the Hop- 
field network is no longer used here. 
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8.6 Continuous Hopfield networks 


Fermi Function with Temperature Parameter 



x 


Figure 8.4: The already known Fermi function 
with different temperature parameter variations. 


Exercises 

Exercise 14. Indicate the storage re¬ 
quirements for a Hopfield network with 
11C | = 1000 neurons when the weights Wij 
shall be stored as integers. Is it possible 
to limit the value range of the weights in 
order to save storage space? 

Exercise 15. Compute the weights Wij 
for a Hopfield network using the training 
set 


(- 1 , 1 , 1 ,- 1 ,- 1 ,- 1 ); 
( 1 ,- 1 ,- 1 , 1 ,- 1 , 1 )}. 
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Chapter 9 

Learning vector quantization 



Learning Vector Quantization is a learning procedure with the aim to represent 
the vector training sets divided into predefined classes as well as possible by 
using a few representative vectors. If this has been managed, vectors which 
were unkown until then could easily be assigned to one of these classes. 


Slowly, part [TT| of this text is nearing its 
end - and therefore I want to write a last 
chapter for this part that will be a smooth 
transition into the next one: A chapter 
about the learning vector quantization 
(abbreviated LVQ) |Koh89 described by 
Teuvo Kohonen, which can be charac¬ 
terized as being related to the self orga¬ 
nizing feature maps. These SOMs are de¬ 
scribed in the next chapter that already 
belongs to part |III| of this text, since SOMs 
learn unsupervised. Thus, after the explo¬ 
ration of LVQ I want to bid farewell to 
supervised learning. 


Previously, I want to announce that there 
are different variations of LVQ, which will 
be mentioned but not exactly represented. 
The goal of this chapter is rather to ana¬ 
lyze the underlying principle. 


9.1 About quantization 


In order to explore the learning vec¬ 
tor quantization we should at first get 
a clearer picture of what quantization 
(which can also be referred to as dis¬ 
cretization) is. 

Everybody knows the sequence of discrete 
numbers 


N = {1,2,3,...}, 


which contains the natural numbers. Dis¬ 
crete means, that this sequence consists of 
separated elements that are not intercon¬ 
nected. The elements of our example are 
exactly such numbers, because the natural 
numbers do not include, for example, num¬ 
bers between 1 and 2. On the other hand, 
the sequence of real numbers M, for in¬ 
stance, is continuous: It does not matter 
how close two selected numbers are, there 
will always be a number between them. 


discrete 
= separated 
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Quantization means that a continuous 
space is divided into discrete sections: By 
deleting, for example, all decimal places 
of the real number 2.71828, it could be 
assigned to the natural number 2. Here 
it is obvious that any other number hav¬ 
ing a 2 in front of the comma would also 
be assigned to the natural number 2, i.e. 
2 would be some kind of representative 
for all real numbers within the interval 
[2; 3). 

It must be noted that a sequence can be ir¬ 
regularly quantized, too: For instance, the 
timeline for a week could be quantized into 
working days and weekend. 

A special case of quantization is digiti¬ 
zation : In case of digitization we always 
talk about regular quantization of a con¬ 
tinuous space into a number system with 
respect to a certain basis. If we enter, for 
example, some numbers into the computer, 
these numbers will be digitized into the bi¬ 
nary system (basis 2). 

Definition 9.1 (Quantization). Separa¬ 
tion of a continuous space into discrete sec¬ 
tions. 

Definition 9.2 (Digitization). Regular 
quantization. 

9.2 LVQ divides the input 
space into separate areas 

Now it is almost possible to describe by 
means of its name what LVQ should en¬ 
able us to do: A set of representatives 
should be used to divide an input space 


into classes that reflect the input space 


as well as nossil 4 p ffig 

Q 1 nn flip faring 

page 

). Thus, each elen 

rent of the input 


space should be assigned to a vector as a 
representative, i.e. to a class, where the 
set of these representatives should repre¬ 
sent the entire input space as precisely as 
possible. Such a vector is called codebook 
vector. A codebook vector is the represen¬ 
tative of exactly those input space vectors 
lying closest to it, which divides the input 
space into the said discrete areas. 

It is to be emphasized that we have to 
know in advance how many classes we 
have and which training sample belongs 
to which class. Furthermore, it is impor¬ 
tant that the classes must not be disjoint, 
which means they may overlap. 

Such separation of data into classes is in¬ 
teresting for many problems for which it 
is useful to explore only some characteris¬ 
tic representatives instead of the possibly 
huge set of all vectors - be it because it is 
less time-consuming or because it is suffi¬ 
ciently precise. 

9.3 Using codebook vectors: 
the nearest one is the 
winner 

The use of a prepared set of codebook vec¬ 
tors is very simple: For an input vector y 
the class association is easily decided by 
considering which codebook vector is the 
closest - so, the codebook vectors build a 
voronoi diagram out of the set. Since 


input space 
reduced to 
vector repre¬ 
sentatives 


closest 

vector 

wins 
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9.4 Adjusting codebook vectors 



Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines represent 
the class limit, the x mark the codebook vectors. 


each codebook vector can clearly be asso¬ 
ciated to a class, each input vector is asso¬ 
ciated to a class, too. 

9.4 Adjusting codebook 
vectors 

As we have already indicated, the LVQ is 
a supervised learning procedure. Thus, we 
have a teaching input that tells the learn¬ 
ing procedure whether the classification of 
the input pattern is right or wrong: In 
other words, we have to know in advance 
the number of classes to be represented or 
the number of codebook vectors. 

Roughly speaking, it is the aim of the 
learning procedure that training samples 


are used to cause a previously defined num¬ 
ber of randomly initialized codebook vec¬ 
tors to reflect the training data as precisely 
as possible. 

9.4.1 The procedure of learning 

Learning works according to a simple 
scheme. We have (since learning is su¬ 
pervised) a set P of |P| training samples. 
Additionally, we already know that classes 
are predefined, too, i.e. we also have a set 
of classes C. A codebook vector is clearly 
assigned to each class. Thus, we can say 
that the set of classes |Cj contains many 
codebook vectors C\. Cb, • ■ ■, C\q\. 

This leads to the structure of the training 
samples: They are of the form (p, c) and 
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therefore contain the training input vector 
p and its class affiliation c. For the class 
affiliation 

c € {1,2,, |C|} 

holds, which means that it clearly assigns 
the training sample to a class or a code¬ 
book vector. 

Intuitively, we could say about learning: 
"Why a learning procedure? We calculate 
the average of all class members and place 
their codebook vectors there - and that’s 
it." But we will see soon that our learning 
procedure can do a lot more. 

I only want to briefly discuss the steps 
of the fundamental LVQ learning proce¬ 
dure: 

Initialization: We place our set of code¬ 
book vectors on random positions in 
the input space. 

Training sample: A training sample p of 
our training set P is selected and pre¬ 
sented. 

Distance measurement: We measure the 
distance \\p — C\\ between all code¬ 
book vectors C\, C 2 , ■ ■ ■, C\c\ and our 
input p. 

Winner: The closest codebook vector 
wins, i.e. the one with 

min lip — CA\. 

CiGC" 


Learning process: The learning process 
takes place according to the rule 

A Ci = r/(t)-h(p,Ci)-(p-Ci) 

(9.1) 

Ci(t + 1) = Ci(t) + A Ci, (9-2) 

which we now want to break down. 

> We have already seen that the first 
factor r](t) is a time-dependent learn¬ 
ing rate allowing us to differentiate 
between large learning steps and fine 
tuning. 

> The last factor (p — Q) is obviously 
the direction toward which the code¬ 
book vector is moved. 

> But the function h(p, Ci) is the core of 
the rule: It implements a distinction 
of cases. 

Assignment is correct: The winner 
vector is the codebook vector of 
the class that includes p. In this 
case, the function provides posi¬ 
tive values and the codebook vec¬ 
tor moves towards p. 

Assignment is wrong: The winner 
vector does not represent the 
class that includes p. Therefore 
it moves away from p. 

We can see that our definition of the func¬ 
tion h was not precise enough. With good 
reason: From here on, the LVQ is divided 
into different nuances, dependent of how 
exactly h and the learning rate should 
be defined (called LVQ1 , LVQ2, LVQ3, 


Important! 
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9.5 Connection to neural networks 


OLVQ, etc). The differences are, for in¬ 
stance, in the strength of the codebook vec¬ 
tor movements. They are not all based on 
the same principle described here, and as 
announced I don’t want to discuss them 
any further. Therefore I don’t give any 
formal definition regarding the aforemen¬ 
tioned learning rule and LVQ. 


Exercises 

Exercise 16. Indicate a quantization 
which equally distributes all vectors H E 
T~L in the five-dimensional unit cube T~L into 
one of 1024 classes. 


9.5 Connection to neural 
networks 


Until now, in spite of the learning process, 
the question was what LVQ has to do with 
neural networks. The codebook vectors 
can be understood as neurons with a fixed 
position within the input space, similar to 
RBF networks. Additionally, in nature it 
neurons? often occurs that in a group one neuron 
may fire (a winner neuron, here: a code¬ 
book vector) and, in return, inhibits all 
other neurons. 

I decided to place this brief chapter about 
learning vector quantization here so that 
this approach can be continued in the fol¬ 
lowing chapter about self-organizing maps: 
We will classify further inputs by means of 
neurons distributed throughout the input 
space, only that this time, we do not know 
which input belongs to which class. 

Now let us take a look at the unsupervised 
learning networksl 
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Unsupervised learning network 

paradigms 
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Chapter 10 

Self-organizing feature maps 



A paradigm of unsupervised learning neural networks, which maps an input 
space by its fixed topology and thus independently looks for simililarities. 

Function, learning procedure, variations and neural gas. 


How are 
data stored 
in the 
brain? 


If you take a look at the concepts of biologi¬ 
cal neural networks mentioned in the intro¬ 
duction, one question will arise: How does 
our brain store and recall the impressions 
it receives every day. Let me point out 
that the brain does not have any training 
samples and therefore no "desired output". 
And while already considering this subject 
we realize that there is no output in this 
sense at all, too. Our brain responds to 
external input by changes in state. These 
are, so to speak, its output. 


Based on this principle and exploring 
the question of how biological neural net¬ 
works organize themselves, Teuvo Ko- 
honen developed in the Eighties his self¬ 
organizing feature maps | Koh82, Koh98], 
shortly referred to as self-organizing 
maps or SO Ms. A paradigm of neural 
networks where the output is the state of 
the network, which learns completely un¬ 
supervised, i.e. without a teacher. 


Unlike the other network paradigms we 
have already got to know, for SOMs it is 
unnecessary to ask what the neurons calcu¬ 
late. We only ask which neuron is active at 
the moment. Biologically, this is very mo¬ 
tivated: If in biology the neurons are con¬ 
nected to certain muscles, it will be less 
interesting to know how strong a certain 
muscle is contracted but which muscle is 
activated. In other words: We are not in¬ 
terested in the exact output of the neuron 
but in knowing which neuron provides out¬ 
put. Thus, SOMs are considerably more 
related to biology than, for example, the 
feedforward networks, which are increas¬ 
ingly used for calculations. 


10.1 Structure of a 

self-organizing map 


Typically, SOMs have - like our brain - 
the task to map a high-dinrensional in¬ 
put (N dimensions) onto areas in a low- 


no output, 
but active 
neuron 
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high-dim. 

input 

4 

low-dim. 

map 


dimensional grid of cells ( G dimensions) 
to draw a map of the high-dimensional 
space, so to speak. To generate this map, 
the SOM simply obtains arbitrary many 
points of the input space. During the in¬ 
put of the points the SOM will try to cover 
as good as possible the positions on which 
the points appear by its neurons. This par¬ 
ticularly means, that every neuron can be 
assigned to a certain position in the input 
space. 

At first, these facts seem to be a bit con¬ 
fusing, and it is recommended to briefly 
reflect about them. There are two spaces 
in which SOMs are working: 

> The IV-dimensional input space and 


o—o—o—o—o 


O—Q—O—Q—Q 







6 - 

-6- 
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Figure 10.1: Example topologies of a self¬ 
organizing map. Above we can see a one¬ 
dimensional topology, below a two-dimensional 
one. 


input space 
and topology 


Important! 


t> the G-dimensional grid on which the 
neurons are lying and which indi¬ 
cates the neighborhood relationships 
between the neurons and therefore 
the network topology. 


In a one-dimensional grid, the neurons 
could be, for instance, like pearls on a 
string. Every neuron would have exactly 
two neighbors (except for the two end neu¬ 
rons). A two-dimensional grid could be a 
square array of neurons (fig. 10.1). An¬ 
other possible array in two-dimensional 
space would be some kind of honeycomb 
shape. Irregular topologies are possible, 
too, but not very often. Topolgies with 
more dimensions and considerably more 
neighborhood relationships would also be 
possible, but due to their lack of visualiza¬ 
tion capability they are not employed very 
often. 


Even if N = G is true, the two spaces are 
not equal and have to be distinguished. In 
this special case they only have the same 
dimension. 

Initially, we will briefly and formally re¬ 
gard the functionality of a self-organizing 
map and then make it clear by means of 
some examples. 

Definition 10.1 (SOM neuron). Similar 
to the neurons in an RBF network a SOM 
neuron k does not occupy a fixed position 
Cfc (a center) in the input space. 

Definition 10.2 (Self-organizing map). 

A self-organizing map is a set K of SOM 
neurons. If an input vector is entered, ex¬ 
actly that neuron k £ K is activated which 


c 


K 
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10.3 Training 


is closest to the input pattern in the input 
space. The dimension of the input space 
is referred to as N. 

Definition 10.3 (Topology). The neu¬ 
rons are interconnected by neighborhood 
relationships. These neighborhood rela¬ 
tionships are called topology. The train¬ 
ing of a SOM is highly influenced by the 
topology. It is defined by the topology 
function h(i,k,t), where i is the winner 
neuron 1 ist, k the neuron to be adapted 
(which will be discussed later) and t the 
timestep. The dimension of the topology 
is referred to as G. 

10.2 SOMs always activate 
the neuron with the 
least distance to an 
input pattern 

Like many other neural networks, the 
SOM has to be trained before it can be 
used. But let us regard the very simple 
functionality of a complete self-organizing 
map before training, since there are many 
analogies to the training. Functionality 
consists of the following steps: 

Input of an arbitrary value p of the input 
space R N . 

Calculation of the distance between ev¬ 
ery neuron k and p by means of a 
norm, i.e. calculation of ||p — c*,||. 

One neuron becomes active, namely 

such neuron i with the shortest 

1 We will learn soon what a winner neuron is. 


calculated distance to the input. All 
other neurons remain inactive.This 
paradigm of activity is also called 
winner-takes-all scheme. The output 
we expect due to the input of a SOM 
shows which neuron becomes active. 

In many literature citations, the descrip¬ 
tion of SOMs is more formal: Often an 
input layer is described that is completely 
linked towards an SOM layer. Then the in¬ 
put layer (N neurons) forwards all inputs 
to the SOM layer. The SOM layer is later¬ 
ally linked in itself so that a winner neuron 
can be established and inhibit the other 
neurons. I think that this explanation of 
a SOM is not very descriptive and there¬ 
fore I tried to provide a clearer description 
of the network structure. 

Now the question is which neuron is ac¬ 
tivated by which input - and the answer 
is given by the network itself during train¬ 
ing. 

10.3 Training 

[Training makes the SOM topology cover 
the input space] The training of a SOM 
is nearly as straightforward as the func¬ 
tionality described above. Basically, it is 
structured into five steps, which partially 
correspond to those of functionality. 

Initialization: The network starts with 
random neuron centers 6 M. N from 
the input space. 

Creating an input pattern: A stimulus, 

i.e. a point p, is selected from the 
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training: 
input, 
—>■ winner i, 
change in 
position 
i and 
neighbors 


input space . Now this stimulus is 
entered into the network. 

Distance measurement: Then the dis¬ 
tance | \p — Ck 11 is determined for every 
neuron k in the network. 

Winner takes all: The winner neuron i 

is determined, which has the smallest 
distance to p, i.e. which fulfills the 
condition 

\\p - Ci\ \ < \ \p - c k \\ V ky^i 

. You can see that from several win¬ 
ner neurons one can be selected at 
will. 

Adapting the centers: The neuron cen¬ 
ters are moved within the input space 
according to the rule 2 

A c k = rj(t) ■ h(i,k,t) ■ (p- c k ), 


where the values Ac*, are simply 
added to the existing centers. The 
last factor shows that the change in 
position of the neurons k is propor¬ 
tional to the distance to the input 
pattern p and, as usual, to a time- 
dependent learning rate pit). The 
above-mentioned network topology ex¬ 
erts its influence by means of the func¬ 
tion h(i, k, t ), which will be discussed 
in the following. 

2 Note: In many sources this rule is written gh(p — 
Ck), which wrongly leads the reader to believe that 
h is a constant. This problem can easily be solved 
by not omitting the multiplication dots ■. 


Definition 10.4 (SOM learning rule). A 
SOM is trained by presenting an input pat¬ 
tern and determining the associated win¬ 
ner neuron. The winner neuron and its 
neighbor neurons, which are defined by the 
topology function, then adapt their cen¬ 
ters according to the rule 

Acfc = g(t) ■ h{i, k, t)-(p- c k ), 

( 10 . 1 ) 

Cfc(fY 1) = c k (t) + Ac fc (f). (10.2) 

10.3.1 The topology function 
defines, how a learning 
neuron influences its 
neighbors 

The topology function h is not defined 
on the input space but on the grid and rep¬ 
resents the neighborhood relationships be¬ 
tween the neurons, i.e. the topology of the 
network. It can be time-dependent (which 
it often is) — which explains the parameter 
t. The parameter k is the index running 
through all neurons, and the parameter i 
is the index of the winner neuron. 

In principle, the function shall take a large 
value if k is the neighbor of the winner neu¬ 
ron or even the winner neuron itself, and 
small values if not. SMore precise defini¬ 
tion: The topology function must be uni- 
modal, i.e. it must have exactly one maxi¬ 
mum. This maximum must be next to the 
winner neuron i, for which the distance to 
itself certainly is 0. 

Additionally, the time-dependence enables 
us, for example, to reduce the neighbor¬ 
hood in the course of time. 


defined on 
the grid 


only 1 maximum 
for the winner 
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10.3 Training 


In order to be able to output large values 
for the neighbors of i and small values for 
non-neighbors, the function h needs some 
kind of distance notion on the grid because 
from somewhere it has to know how far i 
and k are apart from each other on the 
grid. There are different methods to cal¬ 
culate this distance. 


On a two-dimensional grid we could apply, 
for instance, the Euclidean distance (lower 
part of fig. 10.2) or on a one-dimensional 

O . 



. O . 

.o 

grid we could simply use the number of the 
connections between the neurons i and k 
(upper part of the same figure). 

o 

. O . 

o 

o 

o 

Definition 10.5 (Topology function). 

The topology function h(i, k, t) describes 

o 

o 

o ,(i) 

o 

the neighborhood relationships in the 
topology. It can be any unimodal func¬ 
tion that reaches its maximum when i = k 

o 

. (ijH — 

o- 

-0 

o 

gilt. Time-dependence is optional, but of¬ 
ten used. 

o 

. O . 

o 

o 

o 


10.3.1.1 Introduction of common 
distance and topology 
functions 


A common distance function would be, for 
example, the already known Gaussian 
bell (see fig. 10.3 on page 153). It is uni- 
rnodal with a maximum close to 0. Addi¬ 
tionally, its width can be changed by ap¬ 
plying its parameter a , which can be used 
to realize the neighborhood being reduced 
in the course of time: We simply relate the 
time-dependence to the a and the result is 


Figure 10.2: Example distances of a one¬ 
dimensional SOM topology (above) and a two- 
dimensional SOM topology (below) between two 
neurons i and k. In the lower case the Euclidean 
distance is determined (in two-dimensional space 
equivalent to the Pythagoream theorem). In the 
upper case we simply count the discrete path 
length between i and k. To simplify matters I 
required a fixed grid edge length of 1 in both 
cases. 
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a monotonically decreasing <r(i). Then our 
topology function could look like this: 

(_\\9i~c k \\ 2 \ 

h(i, k, t) = e V 2 rT(t) ' 2 /, (10.3) 

where gt and gk represent the neuron po¬ 
sitions on the grid , not the neuron posi¬ 
tions in the input space, which would be 
referred to as c* and c*,. 


Typical sizes of the target value of a learn¬ 
ing rate are two sizes smaller than the ini¬ 
tial value, e.g 

0.01 < g < 0.6 

could be true. But this size must also de¬ 
pend on the network topology or the size 
of the neighborhood. 


Other functions that can be used in¬ 
stead of the Gaussian function are, for 
instance, the cone function , the cylin¬ 
der function or the Mexican hat func¬ 
tion (fig. 10.3 on the facing page). Here, 


the Mexican hat function offers a particu¬ 
lar biological motivation: Due to its neg¬ 
ative digits it rejects some neurons close 
to the winner neuron, a behavior that has 
already been observed in nature. This can 
cause sharply separated map areas - and 
that is exactly why the Mexican hat func¬ 
tion has been suggested by Teuvo Koho- 
nen himself. But this adjustment charac¬ 
teristic is not necessary for the functional¬ 
ity of the map, it could even be possible 
that the map would diverge, i.e. it could 
virtually explode. 


As we have already seen, a decreasing 
neighborhood size can be realized, for ex¬ 
ample, by means of a time-dependent, 
monotonically decreasing a with the 
Gaussin bell being used in the topology 
function. 

The advantage of a decreasing neighbor¬ 
hood size is that in the beginning a moving 
neuron "pulls along" many neurons in its 
vicinity, i.e. the randomly initialized net¬ 
work can unfold fast and properly in the 
beginning. In the end of the learning pro¬ 
cess, only a few neurons are influenced at 
the same time which stiffens the network 
as a whole but enables a good "fine tuning" 
of the individual neurons. 

It must be noted that 


10.3.2 Learning rates and 

neighborhoods can decrease 
monotonically over time 

To avoid that the later training phases 
forcefully pull the entire map towards 
a new pattern, the SOMs often work 
with temporally monotonically decreasing 
learning rates and neighborhood sizes. At 
first, let us talk about the learning rate: 


h ■ 7 7 < 1 

must always be true, since otherwise the 
neurons would constantly miss the current 
training sample. 

But enough of theory - let us take a look 
at a SOM in action! 
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10.3 Training 


i 
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Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug¬ 
gested by Kohonen as examples for topology functions of a SOM.. 
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Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgy 
space (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. In 
the topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of the 
winner neuron and its neighbors towards the training sample p. 

To illustrate the one-dimensional topology of the network, it is plotted into the input space by the 
dotted line. The arrows mark the movement of the winner neuron and its neighbors towards the 
pattern. 
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10.4 Examples 


10.4 Examples for the 

functionality of SOMs 

Let us begin with a simple, mentally com¬ 
prehensible example. 

In this example, we use a two-dimensional 
input space, i.e. IV = 2 is true. Let the 
grid structure be one-dinrensional (G = 1). 
Furthermore, our example SOM should 
consist of 7 neurons and the learning rate 
should be r/ = 0.5. 

The neighborhood function is also kept 
simple so that we will be able to mentally 
comprehend the network: 

{ 1 k direct neighbor of i, 

1 k = i, 

0 otherw. 

(10.4) 


Thus, the factor (p — c k ) indicates the 
vector of the neuron k to the pattern 
p. This is now multiplied by different 
scalars: 

Our topology function h indicates that 
only the winner neuron and its two 
closest neighbors (here: 2 and 4) are 
allowed to learn by returning 0 for 
all other neurons. A time-dependence 
is not specified. Thus, our vector 
(p — Cfc) is multiplied by either 1 or 
0 . 

The learning rate indicates, as always, 
the strength of learning. As already 
mentioned, rj = 0.5, i. e. all in all, the 
result is that the winner neuron and 
its neighbors (here: 2, 3 and 4) ap¬ 
proximate the pattern p half the way 
(in the figure marked by arrows). 


Now let us take a look at the above- 
mentioned network with random initializa¬ 


tion of the centers (fig. 10.4 on the preced¬ 


ing page) and enter a training sample p. 


Obviously, in our example the input pat¬ 
tern is closest to neuron 3, i.e. this is the 
winning neuron. 


We remember the learning rule for 
SOMs 


Ac fc = 77 (f) • h(i, k, t)-(p- c k ) 


Although the center of neuron 7 - seen 
from the input space - is considerably 
closer to the input pattern p than neuron 
2, neuron 2 is learning and neuron 7 is 
not. I want to remind that the network 
topology specifies which neuron is allowed 
to learn and not its position in the input 
space. This is exactly the mechanism by 
which a topology can significantly cover an 
input space without having to be related 
to it by any sort. 


topology 
specifies, 
who will learn 


and process the three factors from the 
back: 

Learning direction: Remember that the 
neuron centers c k are vectors in the 
input space, as well as the pattern p. 


After the adaptation of the neurons 2, 3 
and 4 the next pattern is applied, and so 
on. Another example of how such a one¬ 
dimensional SOM can develop in a two- 
dimensional input space with uniformly 
distributed input patterns in the course of 
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time can be seen in figure 10.5 on the fao 


mg page 


End states of one- and two-dimensional 
SOMs with differently shaped input spaces 


can be seen in figure 10.6 on page 158 
As we can see, not every input space can 
be neatly covered by every network topol¬ 
ogy. There are so called exposed neurons 
- neurons which are located in an area 
where no input pattern has ever been oc¬ 
curred. A one-dinrensional topology gen¬ 
erally produces less exposed neurons than 
a two-dimensional one: For instance, dur¬ 
ing training on circularly arranged input 
patterns it is nearly impossible with a two- 
dimensional squared topology to avoid the 
exposed neurons in the center of the cir¬ 
cle. These are pulled in every direction 
during the training so that they finally 
remain in the center. But this does not 
make the one-dinrensional topology an op¬ 
timal topology since it can only find less 
complex neighborhood relationships than 
a multi-dimensional one. 


10.4.1 Topological defects are 

failures in SOM unfolding 



Figure 10.7: A topological defect in a two- 
dimensional SOM. 


neighborhood size, because the more com¬ 
plex the topology is (or the more neigh¬ 
bors each neuron has, respectively, since a 
three-dimensional or a honeycombed two- 
dimensional topology could also be gener¬ 
ated) the more difficult it is for a randomly 
initialized map to unfold. 


"knot" 
in map 


During the unfolding of a SOM it 
could happen that a topological defect 
(fig. 10.7) occurs, i.e. the SOM does not 


unfold correctly. A topological defect can 
be described at best by means of the word 
"knotting". 


10.5 It is possible to adjust 
the resolution of certain 
areas in a SOM 


A remedy for topological defects could We have seen that a SOM is trained by 
be to increase the initial values for the entering input patterns of the input space 
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10.5 Adjustment of resolution and position-dependent learning rate 



Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100, 
300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p £ R 2 . During the 
training r; decreased from 1.0 to 0.1, the a parameter of the Gauss function decreased from 10.0 
to 0.2. 
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Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column) 
SOMs on different input spaces. 200 neurons were used for the one-dimensional topology, 10 x 10 
neurons for the two-dimensionsal topology and 80.000 input patterns for all maps. 
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10.6 Application 


more 

patterns 

i 

higher 

resolution 


one after another, again and again so 
that the SOM will be aligned with these 
patterns and map them. It could happen 
that we want a certain subset U of the in¬ 
put space to be mapped more precise than 
the other ones. 

This problem can easily be solved by 
means of SOMs: During the training dis- 
proportionally many input patterns of the 
area U are presented to the SOM. If the 
number of training patterns of U C 
presented to the SOM exceeds the number 
of those patterns of the remaining M> N \ U, 
then more neurons will group there while 
the remaining neurons are sparsely dis¬ 
tributed on M> n \ U (fig. 
page]). 

As you can see in the illustration, the edge 
of the SOM could be deformed. This can 
be compensated by assigning to the edge 
of the input space a slightly higher proba¬ 
bility of being hit by training patterns (an 
often applied approach for reaching every 
corner with the SOMs). 

Also, a higher learning rate is often used 
for edge and corner neurons, since they are 
only pulled into the center by the topol¬ 
ogy. This also results in a significantly im¬ 
proved corner coverage. 


10.8 on the next 


10.6 Application of SOMs 


Regarding the biologically inspired asso¬ 
ciative data storage, there are many 
fields of application for self-organizing 
maps and their variations. 


For example, the different phonemes of 
the finnish language have successfully been 
mapped onto a SOM with a two dimen¬ 
sional discrete grid topology and therefore 
neighborhoods have been found (a SOM 
does nothing else than finding neighbor¬ 
hood relationships). So one tries once 
more to break down a high-dimensional 
space into a low-dimensional space (the 
topology), looks if some structures have 
been developed - et voila: clearly defined 
areas for the individual phenomenons are 
formed. 

Teuvo Kohonen himself made the ef¬ 
fort to search many papers mentioning his 
SOMs in their keywords. In this large in¬ 
put space the individual papers now indi¬ 
vidual positions, depending on the occur¬ 
rence of keywords. Then Kohonen created 
a SOM with G = 2 and used it to map the 
high-dinrensional "paper space" developed 
by him. 

Thus, it is possible to enter any paper 
into the completely trained SOM and look 
which neuron in the SOM is activated. It 
will be likely to discover that the neigh¬ 
bored papers in the topology are interest¬ 
ing, too. This type of brain-like context- 
based search also works with many other 
input spaces. 

It is to be noted that the system itself 
defines what is neighbored, i.e. similar , 
within the topology - and that’s why it 
is so interesting. 

This example shows that the position c of 
the neurons in the input space is not signif¬ 
icant. It is rather interesting to see which 


SOM finds 
similarities 
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Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side, 
the chance to become a training pattern was equal for each coordinate of the input space. On the 
right side, for the central circle in the input space, this chance is more than ten times larger than 
for the remaining input space (visible in the larger pattern density in the background). In this circle 
the neurons are obviously more crowded and the remaining area is covered less dense but in both 
cases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000 
training samples and decreasing 77 (1 —► 0.2) as well as decreasing a (5 —> 0.5). 
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10.7 Variations 


neuron is activated when an unknown in¬ 
put pattern is entered. Next, we can look 
at which of the previous inputs this neu¬ 
ron was also activated - and will imme¬ 
diately discover a group of very similar 
inputs. The more the inputs within the 
topology are diverging, the less things they 
have in common. Virtually, the topology 
generates a map of the input characteris¬ 
tics - reduced to descriptively few dimen¬ 
sions in relation to the input dimension. 

Therefore, the topology of a SOM often 
is two-dimensional so that it can be easily 
visualized, while the input space can be 
very high-dimensional. 


10.6.1 SOMs can be used to 

determine centers for RBF 
neurons 

SOMs arrange themselves exactly towards 
the positions of the outgoing inputs. As a 
result they are used, for example, to select 
the centers of an RBF network. We have 
already been introduced to the paradigm 
of the RBF network in chapter [6] 

As we have already seen, it is possible 
to control which areas of the input space 
should be covered with higher resolution 
- or, in connection with RBF networks, 
on which areas of our function should the 
RBF network work with more neurons, i.e. 
work more exactly. As a further useful fea¬ 
ture of the combination of RBF networks 
with SOMs one can use the topology ob¬ 
tained through the SOM: During the final 
training of a RBF neuron it can be used 


to influence neighboring RBF neurons in 
different ways. 

For this, many neural network simulators 
offer an additional so-called SOM layer 
in connection with the simulation of RBF 
networks. 


10.7 Variations of SOMs 

There are different variations of SOMs 
for different variations of representation 
tasks: 


10.7.1 A neural gas is a SOM 

without a static topology 


The neural gas is a variation of the self¬ 
organizing maps of Thomas Martinetz 
|MBS93 , which has been developed from 


the difficulty of mapping complex input 
information that partially only occur in 
the subspaces of the input space or even 


change the subspaces (fig. 10.9 on the fol¬ 


lowing page). 


The idea of a neural gas is, roughly speak¬ 
ing, to realize a SOM without a grid struc¬ 
ture. Due to the fact that they are de¬ 
rived from the SOMs the learning steps 
are very similar to the SOM learning steps, 
but they include an additional intermedi¬ 
ate step: 


D> again, random initialization of E 

M n 

D> selection and presentation of a pat¬ 
tern of the input space p E M n 
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Figure 10.9: A figure filling different subspaces of the actual input space of different positions 
therefore can hardly be filled by a SOM. 


> neuron distance measurement 

> identification of the winner neuron i 

t> Intermediate step: generation of a list 
L of neurons sorted in ascending order 
by their distance to the winner neu¬ 
ron. Thus, the first neuron in the list 
L is the neuron that is closest to the 
winner neuron. 

> changing the centers by means of the 
known rule but with the slightly mod¬ 
ified topology function 

h L (i, k,t). 


of the winner neuron i. The direct re¬ 
sult is that - similar to the free-floating 
molecules in a gas - the neighborhood rela¬ 
tionships between the neurons can change 
anytime, and the number of neighbors is 
almost arbitrary, too. The distance within 
the neighborhood is now represented by 
the distance within the input space. 

The bulk of neurons can become as stiff¬ 
ened as a SOM by means of a constantly 
decreasing neighborhood size. It does not 
have a fixed dimension but it can take the 
dimension that is locally needed at the mo¬ 
ment, which can be very advantageous. 


dynamic 

neighborhood 


The function hi,(i,k,t), which is slightly 
modified compared with the original func¬ 
tion h(i,k,t), now regards the first el¬ 
ements of the list as the neighborhood 


A disadvantage could be that there is 
no fixed grid forcing the input space to 
become regularly covered, and therefore 
wholes can occur in the cover or neurons 
can be isolated. 
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10.7 Variations 


can classify 
complex 
figure 


In spite of all practical hints, it is as al¬ 
ways the user’s responsibility not to un¬ 
derstand this text as a catalog for easy an¬ 
swers but to explore all advantages and 
disadvantages himself. 

Unlike a SOM, the neighborhood of a neu¬ 
ral gas must initially refer to all neurons 
since otherwise some outliers of the ran¬ 
dom initialization may never reach the re¬ 
maining group. To forget this is a popular 
error during the implementation of a neu¬ 
ral gas. 


With a neural gas it is possible to learn a 

Jdn.d-uL&Q.mxJ.f^^m-u.u.t-.surli as in lie:, I5I5| 


on the preceding page since we are not 
bound to a fixed-dimensional grid. But 
some computational effort could be neces¬ 
sary for the permanent sorting of the list 
(here, it could be effective to store the list 
in an ordered data structure right from the 
start). 


Definition 10.6 (Neural gas). A neural 
gas differs from a SOM by a completely dy¬ 
namic neighborhood function. With every 
learning cycle it is decided anew which neu¬ 
rons are the neigborhood neurons of the 
winner neuron. Generally, the criterion 
for this decision is the distance between 
the neurosn and the winner neuron in the 
input space. 


10.7.2 A Multi-SOM consists of 
several separate SOMs 


problem: What do we do with input pat¬ 
terns from which we know that they are 
confined in different (maybe disjoint) ar¬ 
eas? 


Here, the idea is to use not only one 
SOM but several ones: A multi-self¬ 
organizing map, shortly referred to as 
M-SOM [GKE01b[|GkE01a[|GS06] . It is 
unnecessary that the SOMs have the same 
topology or size, an M-SOM is just a com¬ 
bination of M SOMs. 


This learning process is analog to that of 
the SOMs. However, only the neurons be¬ 
longing to the winner SOM of each train¬ 
ing step are adapted. Thus, it is easy to 
represent two disjoint clusters of data by 
means of two SOMs, even if one of the 
clusters is not represented in every dimen¬ 
sion of the input space M w . Actually, the 
individual SOMs exactly reflect these clus¬ 
ters. 


Definition 10.7 (Multi-SOM). A multi- 
SOM is nothing more than the simultane¬ 
ous use of M SOMs. 


10.7.3 A multi-neural gas consists 
of several separate neural 
gases 


Analogous to the multi-SOM, we also have 
a set of M neural gases: a multi-neural 
gas (GS06, jSG06|. This construct be¬ 
haves analogous to neural gas and M-SOM: 
Again, only the neurons of the winner gas 
are adapted. 


several SOMs 


several gases 


In order to present another variant of the The reader certainly wonders what advan- 
SOMs, I want to formulate an extended tage is there to use a multi-neural gas since 
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less computa¬ 
tional effort 


an individual neural gas is already capa¬ 
ble to divide into clusters and to work on 
complex input patterns with changing di¬ 
mensions. Basically, this is correct, but 
a multi-neural gas has two serious advan¬ 
tages over a simple neural gas. 

1. With several gases, we can directly 
tell which neuron belongs to which 
gas. This is particularly important 
for clustering tasks, for which multi- 
neural gases have been used recently. 
Simple neural gases can also find and 
cover clusters, but now we cannot rec¬ 
ognize which neuron belongs to which 
cluster. 

2. A lot of computational effort is saved 
when large original gases are divided 
into several smaller ones since (as al¬ 
ready mentioned) the sorting of the 
list L could use a lot of computa¬ 
tional effort while the sorting of sev¬ 
eral smaller lists L\, L 2 , ■. ., Lm is less 
time-consuming - even if these lists in 
total contain the same number of neu¬ 
rons. 

As a result we will only obtain local in¬ 
stead of global sortings, but in most cases 
these local sortings are sufficient. 

Now we can choose between two extreme 
cases of multi-neural gases: One extreme 
case is the ordinary neural gas M = 1, i.e. 
we only use one single neural gas. Interest¬ 
ing enough, the other extreme case (very 
large M, a few or only one neuron per gas) 
behaves analogously to the K-means clus¬ 
tering (for more information on clustering 
procedures see excursus 0- 


Definition 10.8 (Multi-neural gas). A 
nrulti-neural gas is nothing more than the 
simultaneous use of M neural gases. 

10.7.4 Growing neural gases can 
add neurons to themselves 

A growing neural gas is a variation of 
the aforementioned neural gas to which 
more and more neurons are added accord¬ 
ing to certain rules. Thus, this is an at¬ 
tempt to work against the isolation of neu¬ 
rons or the generation of larger wholes in 
the cover. 

Here, this subject should only be men¬ 
tioned but not discussed. 

To build a growing SOM is more difficult 
because new neurons have to be integrated 
in the neighborhood. 

Exercises 

Exercise 17. A regular, two-dimensional 
grid shall cover a two-dimensional surface 
as "well" as possible. 

1. Which grid structure would suit best 
for this purpose? 

2. Which criteria did you use for "well" 
and "best"? 

The very imprecise formulation of this ex¬ 
ercise is intentional. 
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Chapter 11 

Adaptive resonance theory 


An ART network in its original form shall classify binary input vectors, i.e. to 
assign them to a 1-out-of-n output. Simultaneously, the so far unclassified 
patterns shall be recognized and assigned to a new class. 


As in the other smaller chapters, we want 
to try to figure out the basic idea of 
the adaptive resonance theory (abbre¬ 
viated: ART) without discussing its the¬ 
ory profoundly. 

In several sections we have already men¬ 
tioned that it is difficult to use neural 
networks for the learning of new informa¬ 
tion in addition to but without destroying 
the already existing information. This cir¬ 
cumstance is called stability / plasticity 
dilemma. 


In 1987, Stephen Grossberg and Gail 
Carpenter published the first version of 
their ART network [Gro76 in order to al¬ 
leviate this problem. This was followed 
by a whole family of ART improvements 
(which we want to discuss briefly, too). 


It is the idea of unsupervised learning, 
whose aim is the (initially binary) pattern 
recognition, or more precisely the catego¬ 
rization of patterns into classes. But addi¬ 


tionally an ART network shall be capable 
to find new classes. 


11.1 Task and structure of an 
ART network 


An ART network comprises exactly two 
layers: the input layer / and the recog¬ 
nition layer O with the input layer be¬ 
ing completely linked towards the recog¬ 
nition layer. This complete link induces 
a top-down weight matrix W that con¬ 
tains the weight values of the connections 
between each neuron in the input layer 
and each neuron in the recognition layer 
(fig. 11.1 on the following page). 

Simple binary patterns are entered into 
the input layer and transferred to the 
recognition layer while the recognition 
layer shall return a l-out-of-|0| encoding, 
i.e. it should follow the winner-takes-all 


11.1 on the following page 


pattern 

recognition 
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Y Y Y Y Y Y 


Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom: 
the recognition layer. In this illustration the lateral inhibition of the recognition layer and the control 
neurons are omitted. 


scheme. For instance, to realize this 1- 
out-of-|0| encoding the principle of lateral 
inhibition can be used - or in the imple¬ 
mentation the most activated neuron can 
be searched. For practical reasons an IF 
query would suit this task best. 


put layer causes an activity within the 
recognition layer while in turn in the recog¬ 
nition layer every activity causes an activ¬ 
ity within the input layer. 


V 


11.1.1 Resonance takes place by 
activities being tossed and 
turned 

But there also exists a bottom-up weight 
matrix V, which propagates the activi¬ 
ties within the recognition layer back into 
the input layer. Now it is obvious that 
these activities are bounced forth and back 
again and again, a fact that leads us to 
resonance. Every activity within the in¬ 


In addition to the two mentioned layers, 
in an ART network also exist a few neu¬ 
rons that exercise control functions such as 
signal enhancement. But we do not want 
to discuss this theory further since here 
only the basic principle of the ART net¬ 
work should become explicit. I have only 
mentioned it to explain that in spite of the 
recurrences, the ART network will achieve 
a stable state after an input. 


layers 

activate 

one 

another 
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11.3 Extensions 


11.2 The learning process of 
an ART network is 
divided to top-down and 
bottom-up learning 


The trick of adaptive resonance theory is 
not only the configuration of the ART net¬ 
work but also the two-piece learning pro¬ 
cedure of the theory: On the one hand 
we train the top-down matrix W, on the 
other hand we train the bottom-up matrix 
V (fig. 11.2 on the next page). 


11.2.1 Pattern input and top-down 
learning 


11.2.3 Adding an output neuron 

Of course, it could happen that the neu¬ 
rons are nearly equally activated or that 
several neurons are activated, i.e. that the 
network is indecisive. In this case, the 
mechanisms of the control neurons acti¬ 
vate a signal that adds a new output neu¬ 
ron. Then the current pattern is assigned 
to this output neuron and the weight sets 
of the new neuron are trained as usual. 

Thus, the advantage of this system is not 
only to divide inputs into classes and to 
find new classes, it can also tell us after 
the activation of an output neuron what a 
typical representative of a class looks like 
- which is a significant feature. 


winner 

neuron 

is 

amplified 


input is 
teach, inp. 
for backward 
weights 


When a pattern is entered into the net¬ 
work it causes - as already mentioned - an 
activation at the output neurons and the 
strongest neuron wins. Then the weights 
of the matrix W going towards the output 
neuron are changed such that the output 
of the strongest neuron is still enhanced, 
i.e. the class affiliation of the input vector 
to the class of the output neuron 11 be¬ 
comes enhanced. 

11.2.2 Resonance and bottom-up 
learning 

The training of the backward weights of 
the matrix V is a bit tricky: Only the 
weights of the respective winner neuron 
are trained towards the input layer and 
our current input pattern is used as teach¬ 
ing input. Thus, the network is trained to 
enhance input vectors. 


Often, however, the system can only mod¬ 
erately distinguish the patterns. The ques¬ 
tion is when a new neuron is permitted to 
become active and when it should learn. 
In an ART network there are different ad¬ 
ditional control neurons which answer this 
question according to different mathemat¬ 
ical rules and which are responsible for in¬ 
tercepting special cases. 

At the same time, one of the largest ob¬ 
jections to an ART is the fact that an 
ART network uses a special distinction of 
cases, similar to an IF query, that has been 
forced into the mechanism of a neural net¬ 
work. 

11.3 Extensions 

As already mentioned above, the ART net¬ 
works have often been extended. 
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ART-2 | [CG87 is extended to continuous 
inputs and additionally offers (in an ex¬ 
tension called ART-2A ) enhancements of 
the learning speed which results in addi¬ 
tional control neurons and layers. 


ART-3 (CG90 


3 improves the learning 
ability of ART-2 by adapting additional 
biological processes such as the chemical 
processes within the synapses 1 . 


Apart from the described ones there exist 
many other extensions. 



Figure 11.2: Simplified illustration of the two- 
piece training of an ART network: The trained 
weights are represented by solid lines. Let us as¬ 
sume that a pattern has been entered into the 
network and that the numbers mark the outputs. 
Top: We can see that fl 2 is the winner neu¬ 
ron. Middle: So the weights are trained towards 
the winner neuron and (below) the weights of 
the winner neuron are trained towards the input 
layer. 


1 Because of the frequent extensions of the adap¬ 
tive resonance theory wagging tongues already call 
them "ART-n networks". 
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Appendix A 

Excursus: Cluster analysis and regional and 
online learnable fields 


In Grimm’s dictionary the extinct German word "Kluster" is described by "was 
dicht und dick zusammensitzet (a thick and dense group of sth.)". In static 
cluster analysis, the formation of groups within point clouds is explored. 
Introduction of some procedures, comparison of their advantages and 
disadvantages. Discussion of an adaptive clustering method based on neural 
networks. A regional and online learnable field models from a point cloud, 
possibly with a lot of points, a comparatively small set of neurons being 

representative for the point cloud. 


As already mentioned, many problems can 
be traced back to problems in cluster 
analysis. Therefore, it is necessary to re¬ 
search procedures that examine whether 
groups (so-called clusters ) exist within 
point clouds. 

Since cluster analysis procedures need a 
notion of distance between two points, a 
metric must be defined on the space 
where these points are situated. 

We briefly want to specify what a metric 
is. 

Definition A.l (Metric). A relation 
dist(xi,X 2 ) defined for two objects x\,X 2 
is referred to as metric if each of the fol¬ 
lowing criteria applies: 

1 . dist(xi, X 2 ) = 0 if and only if x\ = X 2 , 


2 . dist(xi,X 2 ) = dist(x 2 ,xi), i.e. sym¬ 
metry, 

3. dist(xi,X3) < dist(xi,X2) + 
dist(x 2 , £ 3 ), i.e. the triangle 
inequality holds. 

Colloquially speaking, a metric is a tool 
for determining distances between points 
in any space. Here, the distances have 
to be symmetrical, and the distance be¬ 
tween to points may only be 0 if the two 
points are equal. Additionally, the trian¬ 
gle inequality must apply. 

Metrics are provided by, for example, the 
squared distance and the Euclidean 
distance, which have already been intro¬ 
duced. Based on such metrics we can de- 
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fine a clustering procedure that uses a met- 7. Continue with [4] until the assignments 
ric as distance measure. are no longer changed. 


Now we want to introduce and briefly dis¬ 
cuss different clustering procedures. 


A.l k-means clustering 
allocates data to a 
predefined number of 
clusters 


k-means clustering according to J. 
MacQueen |Mac67 is an algorithm that 
is often used because of its low computa¬ 
tion and storage complexity and which is 
regarded as "inexpensive and good". The 
operation sequence of the k-means cluster¬ 
ing algorithm is the following: 


1 . Provide data to be examined. 

2. Define k, which is the number of clus¬ 
ter centers. 

3. Select k random vectors for the clus¬ 
ter centers (also referred to as code¬ 
book vectors). 

4. Assign each data point to the next 
codebook vector 1 

5. Compute cluster centers for all clus¬ 
ters. 

6 . Set codebook vectors to new cluster 
centers. 

1 The name codebook vector was created because 

the often used name cluster vector was too un¬ 
clear. 


Step [2] already shows one of the great ques¬ 
tions of the k-means algorithm: The num¬ 
ber k of the cluster centers has to be de¬ 
termined in advance. This cannot be done 
by the algorithm. The problem is that it 
is not necessarily known in advance how k 
can be determined best. Another problem 
is that the procedure can become quite in¬ 
stable if the codebook vectors are badly 
initialized. But since this is random, it 
is often useful to restart the procedure. 
This has the advantage of not requiring 
much computational effort. If you are fully 
aware of those weaknesses, you will receive 
quite good results. 

However, complex structures such as "clus¬ 
ters in clusters" cannot be recognized. If k 
is high, the outer ring of the construction 
in the following illustration will be recog¬ 
nized as many single clusters. If k is low, 
the ring with the small inner clusters will 
be recognized as one cluster. 


For an illustration see the upper right part 


of fig. A.l on page 174 


A.2 k-nearest neighboring 
looks for the k nearest 
neighbors of each data 
point 


The k-nearest neighboring procedure 

[CH67] connects each data point to the k 
closest neighbors, which often results in a 
division of the groups. Then such a group 


number of 
cluster 
must be 
known 
previously 
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A.4 The silhouette coefficient 


clustering 

next 

points 


builds a cluster. The advantage is that 
the number of clusters occurs all by it¬ 
self. The disadvantage is that a large stor¬ 
age and computational effort is required to 
find the next neighbor (the distances be¬ 
tween all data points must be computed 
and stored). 

There are some special cases in which the 
procedure combines data points belonging 
to different clusters, if k is too high, (see 
the two small clusters in the upper right 
of the illustration). Clusters consisting of 
only one single data point are basically 
conncted to another cluster, which is not 
always intentional. 

Furthermore, it is not mandatory that the 
links between the points are symmetric. 

But this procedure allows a recognition of 
rings and therefore of "clusters in clusters", 
which is a clear advantage. Another ad¬ 
vantage is that the procedure adaptively 
responds to the distances in and between 
the clusters. 


For an illustration see the lower left part 


of fig. A.l 


which is the reason for the name epsilon- 
nearest neighboring. Points are neig- 
bors if they are at most e apart from each 
other. Here, the storage and computa¬ 
tional effort is obviously very high, which 
is a disadvantage. 

But note that there are some special cases: 
Two separate clusters can easily be con¬ 
nected due to the unfavorable situation of 
a single data point. This can also happen 
with ^-nearest neighboring, but it would 
be more difficult since in this case the num¬ 
ber of neighbors per point is limited. 


An advantage is the symmetric nature of 
the neighborhood relationships. Another 
advantage is that the combination of min¬ 
imal clusters due to a fixed number of 
neighbors is avoided. 

On the other hand, it is necessary to skill¬ 
fully initialize e in order to be successful, 
i.e. smaller than half the smallest distance 
between two clusters. With variable clus¬ 
ter and point distances within clusters this 
can possibly be a problem. 


For an illustration see the lower right part 


of fig. A.l 


clustering 
radii around 
points 


A.3 e-nearest neighboring 

looks for neighbors within 
the radius e for each 
data point 

Another approach of neighboring: here, 
the neighborhood detection does not use a 
fixed number k of neighbors but a radius e, 


A.4 The silhouette coefficient 
determines how accurate 
a given clustering is 

As we can see above, there is no easy an¬ 
swer for clustering problems. Each proce¬ 
dure described has very specific disadvan¬ 
tages. In this respect it is useful to have 
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Figure A.l: Top left: our set of points. We will use this set to explore the different clustering 
methods. Top right: fc-means clustering. Using this procedure we chose k = 6. As we can 
see, the procedure is not capable to recognize "clusters in clusters" (bottom left of the illustration). 
Long "lines" of points are a problem, too: They would be recognized as many small clusters (if k 
is sufficiently large). Bottom left: fc-nearest neighboring. If k is selected too high (higher than 
the number of points in the smallest cluster), this will result in cluster combinations shown in the 
upper right of the illustration. Bottom right: e-nearest neighboring. This procedure will cause 
difficulties if e is selected larger than the minimum distance between two clusters (see upper left of 
the illustration), which will then be combined. 
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A.5 Regional and online learnable fields 


clustering 
quality is 
measureable 


a criterion to decide how good our clus¬ 
ter division is. This possibility is offered 
by the silhouette coefficient according 
to |Kau90 . This coefficient measures how 
well the clusters are delimited from each 
other and indicates if points may be as¬ 
signed to the wrong clusters. 


Let P be a point cloud and p a point in 
P. Let c C P be a cluster within the 
point cloud and p be part of this cluster, 
i.e. p£c. The set of clusters is called C. 
Summary: 

p G c C P 

applies. 


To calculate the silhouette coefficient, we 
initially need the average distance between 
point p and all its cluster neighbors. This 
variable is referred to as a(p) and defined 
as follows: 


a(p) 


——j- dist (2h'?) 

q&c,q^p 


(A.l) 


Furthermore, let b(p) be the average dis¬ 
tance between our point p and all points 
of the next cluster (g represents all clusters 
except for c): 

b(p) = min - l -^dist(p,g) (A.2) 
9& c ^c \g\ ^ 

The point p is classified well if the distance 
to the center of the own cluster is minimal 
and the distance to the centers of the other 
clusters is maximal. In this case, the fol¬ 
lowing term provides a value close to 1: 

s(p) = - V , T ./ u (A.3) 

maxjafpj, b{p)\ 


Apparently, the whole term s(p) can only 
be within the interval [—1; 1]. A value 
close to -1 indicates a bad classification of 
P- 

The silhouette coefficient S(P) results 
from the average of all values s(p): 

= s (p)- ( A - 4 ) 

11 P eP 

As above the total quality of the clus¬ 
ter division is expressed by the interval 
[-!;!]■ 


As different clustering strategies with dif¬ 
ferent characteristics have been presented 
now (lots of further material is presented 
in |DHS01 ]), as well as a measure to in¬ 
dicate the quality of an existing arrange¬ 
ment of given data into clusters, I want 
to introduce a clustering method based 
on an unsupervised learning neural net¬ 
work | SGE05 which was published in 2005. 
Like all the other methods this one may 
not be perfect but it eliminates large stan¬ 
dard weaknesses of the known clustering 
methods 


A.5 Regional and online 
learnable fields are a 
neural clustering strategy 


The paradigm of neural networks, which I 
want to introduce now, are the regional 
and online learnable fields , shortly re¬ 
ferred to as ROLFs. 
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A.5.1 ROLFs try to cover data with 
neurons 

Roughly speaking, the regional and online 
learnable fields are a set K of neurons 
which try to cover a set of points as well 
as possible by means of their distribution 
in the input space. For this, neurons are 
added, moved or changed in their size dur- 

network 

covers ing training if necessary. The parameters 
point cloud 0 f ffj e individual neurons will be discussed 
later. 

Definition A.2 (Regional and online 
learnable field). A regional and on¬ 
line learnable field (abbreviated ROLF or 
ROLF network) is a set I\ of neurons that 
are trained to cover a certain set in the 
input space as well as possible. 



Figure A. 2: Structure of a ROLF neuron. 


a 


neuron 

represents 

surface 


A.5.1.1 ROLF neurons feature a 

position and a radius in the 
input space 

Here, a ROLF neuron k E K has two 
parameters: Similar to the RBF networks, 
it has a center Ck, i.e. a position in the 
input space. 

But it has yet another parameter: The ra¬ 
dius < 7 , which defines the radius of the per¬ 
ceptive surface surrounding the neuron 2 . 
A neuron covers the part of the input space 
that is situated within this radius. 

Ck and <7fc are locally defined for each neu¬ 


ron. This particularly means that the neu¬ 
rons are capable to cover surfaces of differ¬ 
ent sizes. 


The radius of the perceptive surface is 
specified by r = p • a (fig. A.2) with 


the multiplier p being globally defined and 
previously specified for all neurons. Intu¬ 
itively, the reader will wonder what this 
multiplicator is used for. Its significance 
will be discussed later. Furthermore, the 
following has to be observed: It is not nec¬ 
essary for the perceptive surface of the dif¬ 
ferent neurons to be of the same size. 


Definition A.3 (ROLF neuron). The pa¬ 
rameters of a ROLF neuron k are a center 
Cfc and a radius cr^. 


2 I write "defines" and not "is" because the actual Definition A.4 (Perceptive suiface). 
radius is specified by <r ■ p. The perceptive surface of a ROLF neuron 
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A.5 Regional and online learnable fields 


Adapting 

existing 

neurons 


k consists of all points within the radius 
p ■ t j in the input space. 

A.5.2 A ROLF learns unsupervised 
by presenting training 
samples online 

Like many other paradigms of neural net¬ 
works our ROLF network learns by receiv¬ 
ing many training samples p of a training 
set P. The learning is unsupervised. For 
each training sample p entered into the net¬ 
work two cases can occur: 

1. There is one accepting neuron k for p 
or 


is an accepting neuron k. Then the radius 
moves towards ||p — c k \\ (i.e. towards the 
distance between p and c k ) and the center 
c k towards p. Additionally, let us define 
the two learning rates rj a and g c for radii 
and centers. 

c k (t + 1) = Cfc(i) + p c (p - c k (t )) 

a k (t+ 1) = a k (t) + rj a (\\p - G k (t)\\ - cr k (t)) 

Note that here a k is a scalar while c k is a 
vector in the input space. 

Definition A.6 (Adapting a ROLF neu¬ 
ron) . A neuron k accepted by a point p is 
adapted according to the following rules: 


Va,n c 


2. there is no accepting neuron at all. 

If in the first case several neurons are suit¬ 
able, then there will be exactly one ac¬ 
cepting neuron insofar as the closest neu¬ 
ron is the accepting one. For the accepting 
neuron k c k and a k are adapted. 

Definition A.5 (Accepting neuron). The 
criterion for a ROLF neuron k to be an 
accepting neuron of a point p is that the 
point p must be located within the percep¬ 
tive surface of k. If p is located in the per¬ 
ceptive surfaces of several neurons, then 
the closest neuron will be the accepting 
one. If there are several closest neurons, 
one can be chosen randomly. 


A.5.2.1 Both positions and radii are 
adapted throughout learning 

Let us assume that we entered a training 
sample p into the network and that there 


c k (t + 1) = c k (t) + rjc(p - c k (t)) (A.5) 

a k (t + 1) = a k (t) + Va(\\p ~ c k (t)\\ - a k (t)) 

(A.6) 


A.5.2.2 The radius multiplier allows 

neurons to be able not only to 
shrink 

Now we can understand the function of the 
multiplier p: Due to this multiplier the per¬ 
ceptive surface of a neuron includes more ^ 
than only all points surrounding the neu¬ 
ron in the radius a. This means that due 
to the aforementioned learning rule a can¬ 
not only decrease but also increase. 

J so the 

neurons 

Definition A.7 (Radius multiplier). The can grow 
radius multiplier p > 1 is globally defined 
and expands the perceptive surface of a 
neuron A; to a multiple of a k . So it is en¬ 
sured that the radius a k cannot only de¬ 
crease but also increase. 
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Generally, the radius multiplier is set to 
values in the lower one-digit range, such 
as 2 or 3. 

So far we only have discussed the case in 
the ROLF training that there is an accept¬ 
ing neuron for the training sample p. 

A.5.2.3 As required, new neurons are 
generated 

This suggests to discuss the approach for 
the case that there is no accepting neu¬ 
ron. 

In this case a new accepting neuron k is 
generated for our training sample. The re¬ 
sult is of course that Ck and Ofc have to be 
initialized. 

The initialization of Cfc can be understood 
intuitively: The center of the new neuron 
is simply set on the training sample, i.e. 

Cfc =p- 

We generate a new neuron because there 
is no neuron close top - for logical reasons, 
we place the neuron exactly on p. 

But how to set a a when a new neuron 
is generated? For this purpose there exist 
different options: 

lnit-cr: We always select a predefined 
static a. 

Minimum a: We take a look at the a of 
each neuron and select the minimum. 

Maximum a: We take a look at the a of 
each neuron and select the maximum. 


Mean a\ We select the mean a of all neu¬ 
rons. 

Currently, the mean-cr variant is the fa¬ 
vorite one although the learning procedure 
also works with the other ones. In the 
minimum-<7 variant the neurons tend to 
cover less of the surface, in the maximum- 
(j variant they tend to cover more of the 
surface. 

Definition A. 8 (Generating a ROLF neu¬ 
ron). If a new ROLF neuron k is gener¬ 
ated by entering a training sample p, then 
c^. is intialized with p and af. according to 
one of the aforementioned strategies (mit¬ 
er, minimum-u, maximum-er, mean-er). 

The training is complete when after re¬ 
peated randomly permuted pattern presen¬ 
tation no new neuron has been generated 
in an epoch and the positions of the neu¬ 
rons barely change. 

A.5.3 Evaluating a ROLF 

The result of the training algorithm is that 
the training set is gradually covered well 
and precisely by the ROLF neurons and 
that a high concentration of points on a 
spot of the input space does not automati¬ 
cally generate more neurons. Thus, a pos¬ 
sibly very large point cloud is reduced to 
very few representatives (based on the in¬ 
put set). 

Then it is very easy to define the num¬ 
ber of clusters: Two neurons are (accord¬ 
ing to the definition of the ROLF) con¬ 
nected when their perceptive surfaces over- 


initialization 
of a 

neurons 


cluster = 
connected 
neurons 
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A.5 Regional and online learnable fields 


less 

storage 

effort! 


recognize 
"cluster in 
clusters" 


lap (i.e. some kind of nearest neighbor¬ 
ing is executed with the variable percep¬ 
tive surfaces). A cluster is a group of 
connected neurons or a group of points of 
the input space covered by these neurons 
(fig.|A.3|). 

Of course, the complete ROLF network 
can be evaluated by means of other clus¬ 
tering methods, i.e. the neurons can be 
searched for clusters. Particularly with 
clustering methods whose storage effort 
grows quadratic to |P| the storage effort 
can be reduced dramatically since gener¬ 
ally there are considerably less ROLF neu¬ 
rons than original data points, but the 
neurons represent the data points quite 
well. 


A.3 


A.5.4 Comparison with popular 
clustering methods 

It is obvious, that storing the neurons 
rather than storing the input points takes 
the biggest part of the storage effort of the 
ROLFs. This is a great advantage for huge 
point clouds with a lot of points. 

Since it is unnecessary to store the en¬ 
tire point cloud, our ROLF, as a neural 
clustering method, has the capability to 
learn online , which is definitely a great ad¬ 
vantage. Furthermore, it can (similar to 
e nearest neighboring or k nearest neigh¬ 
boring) distinguish clusters from enclosed 
clusters - but due to the online presenta¬ 
tion of the data without a quadratically 
growing storage effort, which is by far the 
greatest disadvantage of the two neighbor¬ 
ing methods. 
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Figure A.3: The clustering process. Top: the 
input set, middle: the input space covered by 
ROLF neurons, bottom: the input space only 
covered by the neurons (representatives). 
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Additionally, the issue of the size of the in¬ 
dividual clusters proportional to their dis¬ 
tance from each other is addressed by us¬ 
ing variable perceptive surfaces - which is 
also not always the case for the two men¬ 
tioned methods. 

The ROLF compares favorably with k- 
means clustering, as well: Firstly, it is un¬ 
necessary to previously know the number 
of clusters and, secondly, L-means cluster¬ 
ing recognizes clusters enclosed by other 
clusters as separate clusters. 

A.5.5 Initializing radii, learning 
rates and multiplier is not 
trivial 

Certainly, the disadvantages of the ROLF 
shall not be concealed: It is not always 
easy to select the appropriate initial value 
for a and p. The previous knowledge 
about the data set can so to say be in¬ 
cluded in p and the initial value of a of the 
ROLF: Fine-grained data clusters should 
use a small p and a small a initial value. 
But the smaller the p the smaller, the 
chance that the neurons will grow if neces¬ 
sary. Here again, there is no easy answer, 
just like for the learning rates p c and p a - 

For p the multipliers in the lower single¬ 
digit range such as 2 or 3 are very popu¬ 
lar. p c and p a successfully work with val¬ 
ues about 0.005 to 0.1, variations during 
run-time are also imaginable for this type 
of network. Initial values for a generally 
depend on the cluster and data distribu¬ 
tion (i.e. they often have to be tested). 
But compared to wrong initializations - 


at least with the mean-cr strategy - they 
are relatively robust after some training 
time. 

As a whole, the ROLF is on a par with 
the other clustering methods and is par¬ 
ticularly very interesting for systems with 
low storage capacity or huge data sets. 

A.5.6 Application examples 

A first application example could be find¬ 
ing color clusters in RGB images. Another 
field of application directly described in 
the ROLF publication is the recognition of 
words transferred into a 720-dimensional 
feature space. Thus, we can see that 
ROLFs are relatively robust against higher 
dimensions. Further applications can be 
found in the field of analysis of attacks on 
network systems and their classification. 

Exercises 

Exercise 18. Determine at least four 
adaptation steps for one single ROLF neu¬ 
ron k if the four patterns stated below 
are presented one after another in the in¬ 
dicated order. Let the initial values for 
the ROLF neuron be q, = (0.1,0.1) and 
(jfc = 1. Furthermore, let p c = 0.5 and 
Pa = 0. Let p = 3. 

P= {(0.1,0.1); 

= (0.9,0.1); 

= (0.1,0.9); 

= (0.9,0.9)}. 
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Appendix B 

Excursus: neural networks used for 
prediction 


Discussion of an application of neural networks: a look ahead into the future 

of time series. 


After discussing the different paradigms of 
neural networks it is now useful to take a 
look at an application of neural networks 
which is brought up often and (as we will 
see) is also used for fraud: The applica¬ 
tion of time series prediction. This ex¬ 
cursus is structured into the description of 
time series and estimations about the re¬ 
quirements that are actually needed to pre¬ 
dict the values of a time series. Finally, I 
will say something about the range of soft¬ 
ware which should predict share prices or 
other economic characteristics by means of 
neural networks or other procedures. 


This chapter should not be a detailed 
description but rather indicate some ap¬ 
proaches for time series prediction. In this 
respect I will again try to avoid formal def¬ 
initions. 


B.l About time series 


A time series is a series of values dis¬ 
cretized in time. For example, daily mea¬ 
sured temperature values or other meteo¬ 
rological data of a specific site could be 
represented by a time series. Share price 
values also represent a time series. Often 
the measurement of time series is timely 
equidistant, and in many time series the 
future development of their values is very 
interesting, e.g. the daily weather fore¬ 
cast. 


Time series can also be values of an actu¬ 
ally continuous function read in a certain 
flistancc of time At ffig_B,l on the next 


page). 


If we want to predict a time series, we will 
look for a neural network that maps the 
previous series values to future develop¬ 
ments of the time series, i.e. if we know 
longer sections of the time series, we will 


time 
series of 
values 


At 
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Figure B.l: A function x that depends on the 
time is sampled at discrete time steps (time dis¬ 
cretized), this means that the result is a time 
series. The sampled values are entered into a 
neural network (in this example an SLP) which 
shall learn to predict the future values of the time 
series. 


have enough training samples. Of course, 
these are not examples for the future to be 
predicted but it is tried to generalize and 
to extrapolate the past by means of the 
said samples. 

But before we begin to predict a time 
series we have to answer some questions 
about this time series we are dealing with 
and ensure that it fulfills some require¬ 
ments. 

1. Do we have any evidence which sug¬ 
gests that future values depend in any 
way on the past values of the time se¬ 
ries? Does the past of a time series 
include information about its future? 

2. Do we have enough past values of the 
time series that can be used as train¬ 
ing patterns? 

3. In case of a prediction of a continuous 
function: What must a useful At look 
like? 

Now these questions shall be explored in 
detail. 

How much information about the future 
is included in the past values of a time se¬ 
ries? This is the most important question 
to be answered for any time series that 
should be mapped into the future. If the 
future values of a time series, for instance, 
do not depend on the past values, then a 
time series prediction based on them will 
be impossible. 

In this chapter, we assume systems whose 
future values can be deduced from their 
states - the deterministic systems. This 
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B.2 One-step-ahead prediction 


leads us to the question of what a system 
state is. 


B.2 One-step-ahead 
prediction 


A system state completely describes a sys¬ 
tem for a certain point of time. The future 
of a deterministic system would be clearly 
defined by means of the complete descrip¬ 
tion of its current state. 


The first attempt to predict the next fu¬ 
ture value of a time series out of past val¬ 
ues is called one-step-ahead prediction 
(fig. B.2 on the following page). 


The problem in the real world is that such 
a state concept includes all things that in¬ 
fluence our system by any means. 

In case of our weather forecast for a spe¬ 
cific site we could definitely determine 
the temperature, the atmospheric pres¬ 
sure and the cloud density as the mete¬ 
orological state of the place at a time t. 
But the whole state would include signifi¬ 
cantly more information. Here, the world¬ 
wide phenomena that control the weather 
would be interesting as well as small local 
pheonomena such as the cooling system of 
the local power plant. 

So we shall note that the system state is de¬ 
sirable for prediction but not always possi¬ 
ble to obtain. Often only fragments of the 
current states can be acquired, e.g. for a 
weather forecast these fragments are the 
said weather data. 


Such a predictor system receives the last 
n observed state parts of the system as 
input and outputs the prediction for the 
next state (or state part). The idea of 
a state space with predictable states is 
called state space forecasting. 

The aim of the predictor is to realize a 
function 

f{x t - n + 1 , ■ ■ .,x t -i,x t ) = x t+ i, (B.l) 

which receives exactly n past values in or¬ 
der to predict the future value. Predicted 
values shall be headed by a tilde (e.g. x) 
to distinguish them from the actual future 
values. 

The most intuitive and simplest approach 
would be to find a linear combination 


predict 
the next 
value 


X 


However, we can partially overcome these 
weaknesses by using not only one single 
state (the last one) for the prediction, but 
by using several past states. From this 
we want to derive our first prediction sys¬ 
tem: 


Xi-j-l — eLQXi T Cl\Xi— i T • • • T OjXi—j 

(B.2) 

that approximately fulfills our condi¬ 
tions. 

Such a construction is called digital fil¬ 
ter. Here we use the fact that time series 
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Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the future 
value from a series of past values. The predicting element (in this case a neural network) is referred 
to as predictor. 


usually have a lot of past values so that we means of the delta rule provides results 
can set up a series of equations 1 : very close to the analytical solution. 


xt — a Q x t -i + ... + ajX t _ 

Xt -1 = a 0 x t-2 + ■ • ■ + a,jX t _2-(n-l) 

: (B.3) 

Xt—n — Xt—n d~ • ■ ■ d~ n—(n—1) 


Thus, n equations could be found for n un¬ 
known coefficients and solve them (if pos¬ 
sible). Or another, better approach: we 
could use m > n equations for n unknowns 
in such a way that the sum of the mean 
squared errors of the already known pre¬ 
diction is minimized. This is called mov¬ 
ing average procedure. 


But this linear structure corresponds to a 
singlelayer perceptron with a linear activa¬ 
tion function which has been trained by 
means of data from the past (The experi¬ 
mental setup would comply with fig. |B.1| 
on page 182). In fact, the training by 


1 Without going into detail, I want to remark that 
the prediction becomes easier the more past values 
of the time series are available. I would like to 
ask the reader to read up on the Nyquist-Shannon 
sampling theorem 


Even if this approach often provides satis¬ 
fying results, we have seen that many prob¬ 
lems cannot be solved by using a single¬ 
layer perceptron. Additional layers with 
linear activation function are useless, as 
well, since a multilayer perceptron with 
only linear activation functions can be re¬ 
duced to a singlelayer perceptron. Such 
considerations lead to a non-linear ap¬ 
proach. 


The multilayer perceptron and non-linear 
activation functions provide a universal 
non-linear function approximator, i.e. we 
can use an n-|fT|-l-MLP for n n inputs out 
of the past. An RBF network could also be 
used. But remember that here the number 
n has to remain low since in RBF networks 
high input dimensions are very complex to 
realize. So if we want to include many past 
values, a multilayer perceptron will require 
considerably less computational effort. 
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B.4 Additional optimization approaches for prediction 


B.3 Two-step-ahead 
prediction 


B.4 Additional optimization 
approaches for prediction 


predict 

future 

values 


direct 
prediction 
is better 


What approaches can we use to to see far¬ 
ther into the future? 


B.3.1 Recursive two-step-ahead 
prediction 

In order to extend the prediction to, for in¬ 
stance, two time steps into the future, we 
could perform two one-step-ahead predic- 
in a row (fig. 

, i.e. a recursive two-step-ahead 
prediction. Unfortunately, the value de¬ 
termined by means of a one-step-ahead 
prediction is generally imprecise so that 
errors can be built up, and the more pre¬ 
dictions are performed in a row the more 
imprecise becomes the result. 


tions 


page 


B.3 on the following 


B.3.2 Direct two-step-ahead 
prediction 


We have already guessed that there exists 
a better approach: Just like the system 
can be trained to predict the next value, 
we can certainly train it to predict the 
next but one value. This means we di¬ 
rectly train, for example, a neural network 
to look two time steps ahead into the fu¬ 
ture, which is referred to as direct two- 
step-ahead prediction (fig. |B.4 on the 


next page). Obviously, the direct two-step- 


ahead prediction is technically identical to 
the one-step-ahead prediction. The only 
difference is the training. 


The possibility to predict values far away 
in the future is not only important because 
we try to look farther ahead into the fu¬ 
ture. There can also be periodic time se¬ 
ries where other approaches are hardly pos¬ 
sible: If a lecture begins at 9 a.m. every 
Thursday, it is not very useful to know how 
many people sat in the lecture room on 
Monday to predict the number of lecture 
participants. The same applies, for ex¬ 
ample, to periodically occurring commuter 
jams. 


B.4.1 Changing temporal 
parameters 

Thus, it can be useful to intentionally leave 
gaps in the future values as well as in the 
past values of the time series, i.e. to in¬ 
troduce the parameter At which indicates 
which past value is used for prediction. 
Technically speaking, we still use a one- 
step-ahead prediction only that we extend 
the input space or train the system to pre¬ 
dict values lying farther away. 

It is also possible to combine different At: 
In case of the traffic jam prediction for a 
Monday the values of the last few days 
could be used as data input in addition to 
the values of the previous Mondays. Thus, 
we use the last values of several periods, 
in this case the values of a weekly and a 
daily period. We could also include an an¬ 
nual period in the form of the beginning of 
the holidays (for sure, everyone of us has 


extent 

input 

period 
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Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second future 
value out of a past value series by means of a second predictor and the involvement of an already 
predicted value. 



Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step is 
predicted directly, the first one is omitted. Technically, it does not differ from a one-step-ahead 
prediction. 
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B.5 Remarks on the prediction of share prices 


already spent a lot of time on the highway 
because he forgot the beginning of the hol¬ 
idays). 

B.4.2 Heterogeneous prediction 


discrete values - often, for example, in a 
daily rhythm (including the maximum and 
minimum values per day, if we are lucky) 
with the daily variations certainly being 
eliminated. But this makes the whole 
thing even more difficult. 


use 

information 
outside of 
time series 


Another prediction approach would be to 
predict the future values of a single time 
series out of several time series, if it is 
assumed that the additional time series 
is related to the future of the first one 
(.heterogeneous one-step-ahead pre¬ 
diction , fig. B.5 on the following page). 


If we want to predict two outputs of two 
related time series, it is certainly possible 
to perform two parallel one-step-ahead pre¬ 
dictions (analytically this is done very of¬ 
ten because otherwise the equations would 
become very confusing); or in case of 
the neural networks an additional output 
neuron is attached and the knowledge of 
both time series is used for both outputs 
(%• 


B.6 on the next page 


You’ll find more and more general material 
on time series in |WG94 . 


B.5 Remarks on the 

prediction of share prices 


Many people observe the changes of a 
share price in the past and try to con¬ 
clude the future from those values in or¬ 
der to benefit from this knowledge. Share 
prices are discontinuous and therefore they 
are principally difficult functions. Further¬ 
more, the functions can only be used for 


There are chartists, i.e. people who look 
at many diagrams and decide by means 
of a lot of background knowledge and 
decade-long experience whether the equi¬ 
ties should be bought or not (and often 
they are very successful). 

Apart from the share prices it is very in¬ 
teresting to predict the exchange rates of 
currencies: If we exchange 100 Euros into 
Dollars, the Dollars into Pounds and the 
Pounds back into Euros it could be pos¬ 
sible that we will finally receive 110 Eu¬ 
ros. But once found out, we would do this 
more often and thus we would change the 
exchange rates into a state in which such 
an increasing circulation would no longer 
be possible (otherwise we could produce 
money by generating, so to speak, a finan¬ 
cial perpetual motion machine. 

At the stock exchange, successful stock 
and currency brokers raise or lower their 
thumbs - and thereby indicate whether in 
their opinion a share price or an exchange 
rate will increase or decrease. Mathemat¬ 
ically speaking, they indicate the first bit 
(sign) of the first derivative of the ex¬ 
change rate. In that way excellent world- 
class brokers obtain success rates of about 
70%. 

In Great Britain, the heterogeneous one- 
step-ahead prediction was successfully 
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Figure B.5: Representation of the heterogeneous one-step-ahead prediction. Prediction of a time 
series under consideration of a second one. 



Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time. 
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B.5 Remarks on the prediction of share prices 


used to increase the accuracy of such pre¬ 
dictions to 76%: In addition to the time 
series of the values indicators such as the 
oil price in Rotterdam or the US national 
debt were included. 

This is just an example to show the mag¬ 
nitude of the accuracy of stock-exchange 
evaluations, since we are still talking only 
about the first bit of the first derivation! 
We still do not know how strong the ex¬ 
pected increase or decrease will be and 
also whether the effort will pay off: Prob¬ 
ably, one wrong prediction could nullify 
the profit of one hundred correct predic¬ 
tions. 


Again and again some software appears 
which uses scientific key words such as 
’’neural networks” to purport that it is ca¬ 
pable to predict where share prices are go¬ 
ing. Do not buy such software! In addi¬ 
tion to the aforementioned scientific exclu¬ 
sions there is one simple reason for this: 
If these tools work - why should the man¬ 
ufacturer sell them? Normally, useful eco¬ 
nomic knowledge is kept secret. If we knew 
a way to definitely gain wealth by means 
of shares, we would earn our millions by 
using this knowledge instead of selling it 
for 30 euros, wouldn’t we? 


How can neural networks be used to pre¬ 
dict share prices? Intuitively, we assume 
that future share prices are a function of 
the previous share values. 

But this assumption is wrong: Share 
prices are no function of their past val¬ 
ues, but a function of their assumed fu- 

share price ’ J J J 

function of ture value. We do not buy shares be- 
assumed cause their values have been increased 
va i ue! during the last days, but because we be¬ 
lieve that they will futher increase tomor¬ 
row. If, as a consequence, many people 
buy a share, they will boost the price. 
Therefore their assumption was right - a 
self-fulfilling prophecy has been gener¬ 
ated, a phenomenon long known in eco¬ 
nomics. 


The same applies the other way around: 
We sell shares because we believe that to¬ 
morrow the prices will decrease. This will 
beat down the prices the next day and gen¬ 
erally even more the day after the next. 
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Appendix C 

Excursus: reinforcement learning 


What if there were no training samples but it would nevertheless be possible 
to evaluate how well we have learned to solve a problem? Let us examine a 
learning paradigm that is situated between supervised and unsupervised 

learning. 


no 

samples 

but 

feedback 


I now want to introduce a more exotic ap¬ 
proach of learning - just to leave the usual 
paths. We know learning procedures in 
which the network is exactly told what to 
do, i.e. we provide exemplary output val¬ 
ues. We also know learning procedures 
like those of the self-organizing maps, into 
which only input values are entered. 

Now we want to explore something in- 
between: The learning paradigm of rein¬ 
forcement learning - reinforcement learn¬ 
ing according to Sutton and Barto 
|SB98 . 

Reinforcement learning in itself is no neu¬ 
ral network but only one of the three learn¬ 
ing paradigms already mentioned in chap¬ 
ter |4j In some sources it is counted among 
the supervised learning procedures since a 
feedback is given. Due to its very rudimen¬ 
tary feedback it is reasonable to separate 
it from the supervised learning procedures 
- apart from the fact that there are no 
training samples at all. 


While it is generally known that pro¬ 
cedures such as backpropagation cannot 
work in the human brain itself, reinforce¬ 
ment learning is usually considered as be¬ 
ing biologically more motivated. 

The term reinforcement learning 
comes from cognitive science and 
psychology and it describes the learning 
system of carrot and stick, which occurs 
everywhere in nature, i.e. learning by 
means of good or bad experience, reward 
and punishment. But there is no learning 
aid that exactly explains what we have 
to do: We only receive a total result 
for a process (Did we win the game of 
chess or not? And how sure was this 
victory?), but no results for the individual 
intermediate steps. 

For example, if we ride our bike with worn 
tires and at a speed of exactly 21,5 
through a turn over some sand with a 
grain size of 0.1mm, on the average, then 
nobody could tell us exactly which han- 
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dlebar angle we have to adjust or, even 
worse, how strong the great number of 
muscle parts in our arms or legs have to 
contract for this. Depending on whether 
we reach the end of the curve unharmed or 
not, we soon have to face the learning expe¬ 
rience, a feedback or a reward, be it good 
or bad. Thus, the reward is very simple 
- but on the other hand it is considerably 
easier to obtain. If we now have tested dif¬ 
ferent velocities and turning angles often 
enough and received some rewards, we will 
get a feel for what works and what does 
not. The aim of reinforcement learning is 
to maintain exactly this feeling. 

Another example for the quasi¬ 
impossibility to achieve a sort of cost or 
utility function is a tennis player who 
tries to maximize his athletic success 
on the long term by means of complex 
movements and ballistic trajectories in 
the three-dimensional space including the 
wind direction, the importance of the 
tournament, private factors and many 
more. 

To get straight to the point: Since we 
receive only little feedback, reinforcement 
learning often means trial and error - and 
therefore it is very slow. 


C.l System structure 


Now we want to briefly discuss different 
sizes and components of the system. We 
will define them more precisely in the fol¬ 
lowing sections. Broadly speaking, rein¬ 
forcement learning represents the mutual 


interaction between an agent and an envi¬ 
ronmental system (fig. ). 

The agent shall solve some problem. He 
could, for instance, be an autonomous 
robot that shall avoid obstacles. The 
agent performs some actions within the 
environment and in return receives a feed¬ 
back from the environment, which in the 
following is called reward. This cycle of ac¬ 
tion and reward is characteristic for rein¬ 
forcement learning. The agent influences 
the system, the system provides a reward 
and then changes. 

The reward is a real or discrete scalar 
which describes, as mentioned above, how 
well we achieve our aim, but it does not 
give any guidance how we can achieve it. 
The aim is always to make the sum of 
rewards as high as possible on the long 
term. 



C.1.1 The gridworld 


As a learning example for reinforcement 
learning I would like to use the so-called 
gridworld. We will see that its struc¬ 
ture is very simple and easy to figure out 
and therefore reinforcement is actually not 
necessary. However, it is very suitable 
for representing the approach of reinforce¬ 
ment learning. Now let us exemplary de¬ 
fine the individual components of the re¬ 
inforcement system by means of the grid- 
world. Later, each of these components 
will be examined more exactly. 


Environment: The gridworld (fig. |C.l on 


the facing page) is a simple, discrete 


simple 

examplary 

world 
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C.l System structure 


world in two dimensions which in the 
following we want to use as environ¬ 
mental system. 

Agent: As an Agent we use a simple robot 
being situated in our gridworld. 

State space: As we can see, our gridworld 
has 5x7 fields with 6 fields being un- 
accessible. Therefore, our agent can 
occupy 29 positions in the grid world. 
These positions are regarded as states 
for the agent. 

Action space: The actions are still miss¬ 
ing. We simply define that the robot 
could move one field up or down, to 
the right or to the left (as long as 
there is no obstacle or the edge of our 
gridworld). 

Task: Our agent’s task is to leave the grid- 
world. The exit is located on the right 
of the light-colored field. 

Non-determinism: The two obstacles can 
be connected by a "door". When the 
door is closed (lower part of the illus¬ 
tration), the corresponding field is in¬ 
accessible. The position of the door 
cannot change during a cycle but only 
between the cycles. 

We now have created a small world that 

will accompany us through the following 

learning strategies and illustrate them. 

C.l.2 Agent und environment 

Our aim is that the agent learns what hap¬ 
pens by means of the reward. Thus, it 



Figure C.l: A graphical representation of our 
gridworld. Dark-colored cells are obstacles and 
therefore inaccessible. The exit is located on the 
right side of the light-colored field. The symbol 
x marks the starting position of our agent. In 
the upper part of our figure the door is open, in 
the lower part it is closed. 


Agent 



environment 


Figure C.2: The agent performs some actions 
within the environment and in return receives a 
reward. 
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agent 
acts in 
environment 


is trained over, of and by means of a dy¬ 
namic system, the environment , in order 
to reach an aim. But what does learning 
mean in this context? 

The agent shall learn a mapping of sit¬ 
uations to actions (called policy ), i.e. it 
shall learn what to do in which situation 
to achieve a certain (given) aim. The aim 
is simply shown to the agent by giving an 
award for the achievement. 

Such an award must not be mistaken for 
the reward - on the agent’s way to the 
solution it may sometimes be useful to 
receive a smaller award or a punishment 
when in return the longterm result is max¬ 
imum (similar to the situation when an 
investor just sits out the downturn of the 
share price or to a pawn sacrifice in a chess 
game). So, if the agent is heading into 
the right direction towards the target, it 
receives a positive reward, and if not it re¬ 
ceives no reward at all or even a negative 
reward (punishment). The award is, so to 
speak, the final sum of all rewards - which 
is also called return. 

After having colloquially named all the ba¬ 
sic components, we want to discuss more 
precisely which components can be used to 
make up our abstract reinforcement learn¬ 
ing system. 

In the gridworld: In the gridworld, the 
agent is a simple robot that should find the 
exit of the gridworld. The environment 
is the gridworld itself, which is a discrete 
gridworld. 

Definition C.l (Agent). In reinforce¬ 
ment learning the agent can be formally 


described as a mapping of the situation 
space S into the action space A(st). The 
meaning of situations St will be defined 
later and should only indicate that the ac¬ 
tion space depends on the current situa¬ 
tion. 

Agent: S —> A(st) (C.l) 

Definition C.2 (Environment). The en¬ 
vironment represents a stochastic map¬ 
ping of an action A in the current situa¬ 
tion st to a reward ry and a new situation 
St+l- 

Environment: S x A —> P(S x rf) (C.2) 

C.l.3 States, situations and actions 

As already mentioned, an agent can be in 
different states: In case of the gridworld, 
for example, it can be in different positions 
(here we get a two-dimensional state vec¬ 
tor). 

For an agent is ist not always possible to 
realize all information about its current 
state so that we have to introduce the term 
situation. A situation is a state from the 
agent’s point of view, i.e. only a more or 
less precise approximation of a state. 

Therefore, situations generally do not al¬ 
low to clearly "predict" successor situa¬ 
tions - even with a completely determin¬ 
istic system this may not be applicable. 
If we knew all states and the transitions 
between them exactly (thus, the complete 
system), it would be possible to plan op¬ 
timally and also easy to find an optimal 
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C.l System structure 


policy (methods are provided, for example, 
by dynamic programming). 

Now we know that reinforcement learning 
is an interaction between the agent and 
the system including actions at and sit¬ 
uations St- The agent cannot determine 
by itself whether the current situation is 
good or bad: This is exactly the reason 
why it receives the said reward from the 
environment. 

In the gridworld: States are positions 
where the agent can be situated. Sim¬ 
ply said, the situations equal the states 
in the gridworld. Possible actions would 
be to move towards north, south, east or 
west. 

Situation and action can be vectorial, the 
reward is always a scalar (in an extreme 
case even only a binary value) since the 
aim of reinforcement learning is to get 
along with little feedback. A complex vec¬ 
torial reward would equal a real teaching 
input. 

By the way, the cost function should be 
minimized, which would not be possible, 
however, with a vectorial reward since we 
do not have any intuitive order relations 
in multi-dimensional space, i.e. we do not 
directly know what is better or worse. 

Definition C.3 (State). Within its en¬ 
vironment the agent is in a state. States 
contain any information about the agent 
within the environmental system. Thus, 
it is theoretically possible to clearly pre¬ 
dict a successor state to a performed ac¬ 
tion within a deterministic system out of 
this godlike state knowledge. 


Definition C.4 (Situation). Situations 
St (here at time t) of a situation space 
S are the agent’s limited, approximate 
knowledge about its state. This approx¬ 
imation (about which the agent cannot 
even know how good it is) makes clear pre¬ 
dictions impossible. 

Definition C.5 (Action). Actions at can 
be performed by the agent (whereupon it 
could be possible that depending on the 
situation another action space A(S) ex¬ 
ists). They cause state transitions and 
therefore a new situation from the agent’s 
point of view. 

C.l.4 Reward and return 

As in real life it is our aim to receive 
an award that is as high as possible, i.e. 
to maximize the sum of the expected re¬ 
wards r, called return R, on the long 
term. For finitely many time steps 1 the 
rewards can simply be added: 

R t = r t +1 + r t+ 2 + • ■ • (C. 3 ) 

OO 

= rt +* ( C4 ) 
X=1 

Certainly, the return is only estimated 
here (if we knew all rewards and therefore 
the return completely, it would no longer 
be necessary to learn). 

Definition C.6 (Reward). A reward rt is 
a scalar, real or discrete (even sometimes 
only binary) reward or punishment which 

1 In practice, only finitely many time steps will be 
possible, even though the formulas are stated with 
an infinite sum in the first place 


St 

S 


at 


MS) 


r t 
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the environmental system returns to the 
agent as reaction to an action. 

Definition C.7 (Return). The return R t 
is the accumulation of all received rewards 
until time t. 


C. 1.4.1 Dealing with long periods of 
time 

However, not every problem has an ex¬ 
plicit target and therefore a finite sum (e.g. 
our agent can be a robot having the task 
to drive around again and again and to 
avoid obstacles). In order not to receive a 
diverging sum in case of an infinite series 
of reward estimations a weakening factor 
0 < 7 < 1 is used, which weakens the in¬ 
fluence of future rewards. This is not only 
useful if there exists no target but also if 
the target is very far away: 

Rt = n+i + 7 1 r t+2 + 7 2 r t+ 3 + ... (C.5) 

oo 

= 7 X_1 n+z (C.6) 

X=1 

The farther the reward is away, the smaller 
is the influence it has in the agent’s deci¬ 
sions. 

Another possibility to handle the return 
sum would be a limited time horizon 
t so that only r many following rewards 
r t+ i, • • •, n+r are regarded: 

R t = r t+ i + ... + 7 T ~ 1 ?y +T (C.7) 

T 

= 7 X ~ l rt+x (C.8) 

X=1 


Thus, we divide the timeline into 
episodes. Usually, one of the two meth¬ 
ods is used to limit the sum, if not both 
methods together. 

As in daily living we try to approximate 
our current situation to a desired state. 
Since it is not mandatory that only the 
next expected reward but the expected to¬ 
tal sum decides what the agent will do, it 
is also possible to perform actions that, on 
short notice, result in a negative reward 
(e.g. the pawn sacrifice in a chess game) 
but will pay off later. 

C.1.5 The policy 

After having considered and formalized 
some system components of reinforcement 
learning the actual aim is still to be dis¬ 
cussed: 

During reinforcement learning the agent 
learns a policy 

n : S —$■ P(A), 

Thus, it continuously adjusts a mapping 
of the situations to the probabilities P(A), 
with which any action A is performed in 
any situation S. A policy can be defined 
as a strategy to select actions that would 
maximize the reward in the long term. 

In the gridworld: In the gridworld the pol¬ 
icy is the strategy according to which the 
agent tries to exit the gridworld. 

Definition C.8 (Policy). The policy n 
s a mapping of situations to probabilities 
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C.l System structure 


to perform every action out of the action 
space. So it can be formalized as 

n:S^P(A). (C.9) 

Basically, we distinguish between two pol¬ 
icy paradigms: An open loop policy rep¬ 
resents an open control chain and creates 
out of an initial situation so a sequence of 
actions ao,ai,... with a t / aj(sj);i > 0 . 
Thus, in the beginning the agent develops 
a plan and consecutively executes it to the 
end without considering the intermediate 
situations (therefore a* 7 ^ aj(sj), actions af¬ 
ter ao do not depend on the situations). 

In the gridworld: In the gridworld, an 
open-loop policy would provide a precise 
direction towards the exit, such as the way 
from the given starting position to (in ab¬ 
breviations of the directions) EEEEN. 

So an open-loop policy is a sequence of 
actions without interim feedback. A se¬ 
quence of actions is generated out of a 
starting situation. If the system is known 
well and truly, such an open-loop policy 
can be used successfully and lead to use¬ 
ful results. But, for example, to know the 
chess game well and truly it would be nec¬ 
essary to try every possible move, which 
would be very time-consuming. Thus, for 
such problems we have to find an alterna¬ 
tive to the open-loop policy, which incorpo¬ 
rates the current situations into the action 
plan: 

A closed loop policy is a closed loop, a 
function 

II : Si —> at with a* = aj(sj), 


in a manner of speaking. Here, the envi¬ 
ronment influences our action or the agent 
responds to the input of the environment, 
respectively, as already illustrated in fig. 
C.2| A closed-loop policy, so to speak, is 


a reactive plan to map current situations 
to actions to be performed. 


In the gridworld: A closed-loop policy 
would be responsive to the current posi¬ 
tion and choose the direction according to 
the action. In particular, when an obsta¬ 
cle appears dynamically, such a policy is 
the better choice. 


When selecting the actions to be per¬ 
formed, again two basic strategies can be 
examined. 


C.l.5.1 Exploitation vs. exploration 

As in real life, during reinforcement learn¬ 
ing often the question arises whether the 
exisiting knowledge is only willfully ex¬ 
ploited or new ways are also explored. 
Initially, we want to discuss the two ex¬ 
tremes: 

A greedy policy always chooses the way 
of the highest reward that can be deter¬ 
mined in advance, i.e. the way of the high¬ 
est known reward. This policy represents 
the exploitation approach and is very 
promising when the used system is already 
known. 

In contrast to the exploitation approach it 
is the aim of the exploration approach 
to explore a system as detailed as possible 
so that also such paths leading to the tar¬ 
get can be found which may be not very 


research 
or safety? 
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promising at first glance but are in fact 
very successful. 

Let us assume that we are looking for the 
way to a restaurant, a safe policy would 
be to always take the way we already 
know, not matter how unoptimal and long 
it may be, and not to try to explore bet¬ 
ter ways. Another approach would be to 
explore shorter ways every now and then, 
even at the risk of taking a long time and 
being unsuccessful, and therefore finally 
having to take the original way and arrive 
too late at the restaurant. 

In reality, often a combination of both 
methods is applied: In the beginning of 
the learning process it is researched with 
a higher probability while at the end more 
existing knowledge is exploited. Here, a 
static probability distribution is also pos¬ 
sible and often applied. 

In the gridworld: For finding the way in 
the gridworld, the restaurant example ap¬ 
plies equally. 


C.2 Learning process 

Let us again take a look at daily life. Ac¬ 
tions can lead us from one situation into 
different subsituations, from each subsit¬ 
uation into further sub-subsituations. In 
a sense, we get a situation tree where 
links between the nodes must be consid¬ 
ered (often there are several ways to reach 
a situation - so the tree could more accu¬ 
rately be referred to as a situation graph). 


he leaves of such a tree are the end situ¬ 
ations of the system. The exploration ap¬ 
proach would search the tree as thoroughly 
as possible and become acquainted with all 
leaves. The exploitation approach would 
unerringly go to the best known leave. 

Analogous to the situation tree, we also 
can create an action tree. Here, the re¬ 
wards for the actions are within the nodes. 
Now we have to adapt from daily life how 
we learn exactly. 


C.2.1 Rewarding strategies 

Interesting and very important is the ques¬ 
tion for what a reward and what kind of 
reward is awarded since the design of the 
reward significantly controls system behav¬ 
ior. As we have seen above, there gener¬ 
ally are (again as in daily life) various ac¬ 
tions that can be performed in any situa¬ 
tion. There are different strategies to eval¬ 
uate the selected situations and to learn 
which series of actions would lead to the 
target. First of all, this principle should 
be explained in the following. 

We now want to indicate some extreme 
cases as design examples for the reward: 

A rewarding similar to the rewarding in a 
chess game is referred to as pure delayed 
reward: We only receive the reward at 
the end of and not during the game. This 
method is always advantageous when we 
finally can say whether we were succesful 
or not, but the interim steps do not allow 
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C.2 Learning process 


an estimation of our situation. If we win, robot but unfortunately was not intended 
then to do so. 


rt = 0 Vf < r (C IO) 

as well as r T = 1. If we lose, then r T = —1. 
With this rewarding strategy a reward is 
only returned by the leaves of the situation 
tree. 

Pure negative reward: Here, 

rt = — 1 Vi < r. (C. 11) 


Furthermore, we can show that especially 
small tasks can be solved better by means 
of negative rewards while positive, more 
differentiated rewards are useful for large, 
complex tasks. 

For our gridworld we want to apply the 
pure negative reward strategy: The robot 
shall find the exit as fast as possible. 


This system finds the most rapid way to 
reach the target because this way is auto¬ 
matically the most favorable one in respect 
of the reward. The agent receives punish¬ 
ment for anything it does - even if it does 
nothing. As a result it is the most inex¬ 
pensive method for the agent to reach the 
target fast. 


C.2.2 The state-value function 


Unlike our agent we have a godlike view 
of our gridworld so that we can swiftly de¬ 
termine which robot starting position can 
provide which optimal return. 


mal returns are applied per field. 


state 

evaluation 


Another strategy is the avoidance strat¬ 
egy: Harmful situations are avoided. 

Here, 

r t G{0,-l}, (C.12) 


In the gridworld: The state-value function 
for our gridworld exactly represents such 
a function per situation (= position) with 
the difference being that here the function 
is unknown and has to be learned. 


Most situations do not receive any reward, 
only a few of them receive a negative re¬ 
ward. The agent agent will avoid getting 
too close to such negative situations 

Warning: Rewarding strategies can have 
unexpected consequences. A robot that is 
told "have it your own way but if you touch 
an obstacle you will be punished" will sim¬ 
ply stand still. If standing still is also pun¬ 
ished, it will drive in small circles. Recon¬ 
sidering this, we will understand that this 
behavior optimally fulfills the return of the 


Thus, we can see that it would be more 
practical for the robot to be capable to 
evaluate the current and future situations. 
So let us take a look at another system 
component of reinforcement learning: the 
state-value function V(s), which with 
regard to a policy n is often called Vf[(s). 
Because whether a situation is bad often 
depends on the general behavior n of the 
agent. 

A situation being bad under a policy that 
is searching risks and checking out limits 
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Figure C.3: Representation of each optimal re¬ 
turn per field in our gridworld by means of pure 
negative reward awarding, at the top with an 
open and at the bottom with a closed door. 


would be, for instance, if an agent on a bi¬ 
cycle turns a corner and the front wheel 
begins to slide out. And due to its dare¬ 
devil policy the agent would not brake in 
this situation. With a risk-aware policy 
the same situations would look much bet¬ 
ter, thus it would be evaluated higher by 
a good state-value function 

Vn(s) simply returns the value the current 
situation s has for the agent under policy 
II. Abstractly speaking, according to the 
above definitions, the value of the state- 
value function corresponds to the return 
Rt (the expected value) of a situation s*. 


Eu denotes the set of the expected returns 
under II and the current situation st- 

kn(s) = E n {R t ]s = s t } 

Definition C.9 (State-value function). 
The state-value function Vn(s) has the 
task of determining the value of situations 
under a policy, i.e. to answer the agent’s 
question of whether a situation s is good 
or bad or how good or bad it is. For this 
purpose it returns the expectation of the 
return under the situation: 

Fn(s) = Eu{Rt]s = s t } (03) 

The optimal state-value function is called 

V500- 

Unfortunaely, unlike us our robot does not 
have a godlike view of its environment. It 
does not have a table with optimal returns 
like the one shown above to orient itself. 
The aim of reinforcement learning is that 
the robot generates its state-value func¬ 
tion bit by bit on the basis of the returns of 
many trials and approximates the optimal 
state-value function V* (if there is one). 

In this context I want to introduce two 
terms closely related to the cycle between 
state-value function and policy: 

C.2.2.1 Policy evaluation 

Policy evaluation is the approach to try 
a policy a few times, to provide many re¬ 
wards that way and to gradually accumu¬ 
late a state-value function by means of 
these rewards. 


V&8) 
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C.2 Learning process 



Y Y 

v* IP 


Figure C.4: The cycle of reinforcement learning 
which ideally leads to optimal II* and V*. 


C.2.2.2 Policy improvement 

Policy improvement means to improve 
a policy itself, i.e. to turn it into a new and 
better one. In order to improve the policy 
we have to aim at the return finally having 
a larger value than before, i.e. until we 
have found a shorter way to the restaurant 
and have walked it successfully 

The principle of reinforcement learning is 
to realize an interaction. It is tried to eval¬ 
uate how good a policy is in individual 
situations. The changed state-value func¬ 
tion provides information about the sys¬ 
tem with which we again improve our pol¬ 
icy. These two values lift each other, which 
can mathematically be proved, so that the 
final result is an optimal policy II* and an 
optimal state-value function V* (fig. |C.4 |. 
This cycle sounds simple but is very time- 
consuming. 

At first, let us regard a simple, random pol¬ 
icy by which our robot could slowly fulfill 
and improve its state-value function with¬ 
out any previous knowledge. 


C.2.3 Monte Carlo method 

The easiest approach to accumulate a 
state-value function is mere trial and er¬ 
ror. Thus, we select a randomly behaving 
policy which does not consider the accumu¬ 
lated state-value function for its random 
decisions. It can be proved that at some 
point we will find the exit of our gridworld 
by chance. 

Inspired by random-based games of chance 
this approach is called Monte Carlo 
method. 

If we additionally assume a pure negative 
reward, it is obvious that we can receive 
an optimum value of —6 for our starting 
field in the state-value function. Depend¬ 
ing on the random way the random policy 
takes values other (smaller) than —6 can 
occur for the starting field. Intuitively, we 
want to memorize only the better value for 
one state (i.e. one field). But here caution 
is advised: In this way, the learning proce¬ 
dure would work only with deterministic 
systems. Our door, which can be open or 
closed during a cycle, would produce oscil¬ 
lations for all fields and such oscillations 
would influence their shortest way to the 
target. 

With the Monte Carlo method we prefer 
to use the learning rule 2 

^(Si)new = ^(Si)alt + a(R t ~ F(s t ) alt ), 

in which the update of the state-value func¬ 
tion is obviously influenced by both the 

2 The learning rule is, among others, derived by 
means of the Bellman equation, but this deriva¬ 
tion is not discussed in this chapter. 
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old state value and the received return (cr 
is the learning rate). Thus, the agent gets 
some kind of memory, new findings always 
change the situation value just a little bit. 
An exemplary learning step is shown in 
fig-H 

In this example, the computation of the 
state value was applied for only one single 
state (our initial state). It should be ob¬ 
vious that it is possible (and often done) 
to train the values for the states visited in- 
between (in case of the gridworld our ways 
to the target) at the same time. The result 
of such a calculation related to our exam¬ 


ple is illustrated in fig. C.6 on the facing 

Ipagcl 


The Monte Carlo method seems to be 
suboptimal and usually it is significantly 
slower than the following methods of re¬ 
inforcement learning. But this method is 
the only one for which it can be mathemat¬ 
ically proved that it works and therefore 
it is very useful for theoretical considera¬ 
tions. 


Definition C.10 (Monte Carlo learning). 

Actions are randomly performed regard¬ 
less of the state-value function and in the 
long term an expressive state-value func¬ 
tion is accumulated by means of the fol¬ 
lowing learning rule. 


V(s t ) new = V(s t ) a i t + a(R t - C(st) alt ), 


C.2.4 Temporal difference learning 
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Figure C.5: Application of the Monte Carlo 
learning rule with a learning rate of a = 0.5. 
Top: two exemplary ways the agent randomly 
selects are applied (one with an open and one 
with a closed door). Bottom: The result of the 
learning rule for the value of the initial state con¬ 
sidering both ways. Due to the fact that in the 
course of time many different ways are walked 
given a random policy, a very expressive state- 
value function is obtained. 


Most of the learning is the result of ex¬ 
periences; e.g. walking or riding a bicycle 
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C.2 Learning process 
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Figure C.6: Extension of the learning example 
in fig. |C.5| in which the returns for intermedi¬ 
ate states are also used to accumulate the state- 
value function. Here, the low value on the door 
field can be seen very well: If this state is possi¬ 
ble, it must be very positive. If the door is closed, 
this state is impossible. 


Evaluation 



n q 



policy improvement 


Figure C.7: We try different actions within the 
environment and as a result we learn and improve 
the policy. 


the temporal difference learning (abbre¬ 
viated: TD learning ), does the same by 
training Vn(s) (i.e. the agent learns to esti¬ 
mate which situations are worth a lot and 
which are not). Again the current situa¬ 
tion is identified with st, the following sit¬ 
uations with s t+ i and so on. Thus, the 
learning formula for the state-value func¬ 
tion Vn(sr) is 

^(St)new =V( St ) 

+ a(r t+ 1 + 7 l/(st + i) - V(s t )) 

"-v-' 

change of previous value 

We can see that the change in value of the 
current situation st, which is proportional 
to the learning rate cr, is influenced by 

D> the received reward r t+ 1 , 

D> the previous return weighted with a 
factor 7 of the following situation 

n*m), 

D> the previous value of the situation 
V(s t ). 


without getting injured (or not), even men¬ 
tal skills like mathematical problem solv¬ 
ing benefit a lot from experience and sim¬ 
ple trial and error. Thus, we initialize our 
policy with arbitrary values - we try, learn 
and improve the policy due to experience 
(fig. C.7). In contrast to the Monte Carlo 
method we want to do this in a more di¬ 
rected manner. 


Definition C.ll (Temporal difference 
learning). Unlike the Monte Carlo 
method, TD learning looks ahead by re¬ 
garding the following situation st+ 1 - Thus, 
the learning rule is given by 

U(si) new =V(s t ) (C.14) 

+ a(r t+ 1 + jV(s t +i) - V(s t )). 

S v- / 

change of previous value 


C.2.5 The action-value function 


Just as we learn from experience to re- Analogous to the state-value function 
act on different situations in different ways Vn(s), the action-value function 


action 

evaluation 
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Qu(s,a) 


Qn(s, a ) 



C.2.6 Q learning 

This implies Qn(s,a) as learning fomula 
for the action-value function, and - analo¬ 
gously to TD learning - its application is 
called Q learning: 


Figure C.8: Exemplary values of an action- 
value function for the position x. Moving right, 
one remains on the fastest way towards the tar¬ 
get, moving up is still a quite fast way, moving 
down is not a good way at all (provided that the 
door is open for all cases). 


( 5 n(s,a) is another system component of 
reinforcement learning, which evaluates a 
certain action a under a certain situation 
s and the policy II. 

In the gridworld: In the gridworld, the 
action-value function tells us how good it 
is to move from a certain field into a cer¬ 
tain direction (fig. |C.8 1. 

Definition C.12 (Action-value function). 
Like the state-value function, the action- 
value function Qn(st,a) evaluates certain 
actions on the basis of certain situations 
under a policy. The optimal action-value 
function is called Q^{st,a). 


As shown in fig. C.9 the actions are per¬ 
formed until a target situation (here re¬ 
ferred to as s T ) is achieved (if there exists a 
target situation, otherwise the actions are 
simply performed again and again). 


^)new —^9 

+ a(r t +1 + 7 max Q(st+i , a) — Q(s t ,a )). 

a 

" -V-' 

greedy strategy 

s --v-' 

change of previous value 

Again we break down the change of the 
current action value (proportional to the 
learning rate a) under the current situa¬ 
tion. It is influenced by 

> the received reward rt+i, 

> the maximum action over the follow¬ 
ing actions weighted with 7 (Here, a 
greedy strategy is applied since it can 
be assumed that the best known ac¬ 
tion is selected. With TD learning, 
on the other hand, we do not mind to 
always get into the best known next 
situation.), 

> the previous value of the action under 
our situation st known as Q(st,a ) (re¬ 
member that this is also weighted by 
means of a). 

Usually, the action-value function learns 
considerably faster than the state-value 
function. But we must not disregard that 
reinforcement learning is generally quite 
slow: The system has to find out itself 
what is good. But the advantage of Q 
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C.3 Example applications 


direction of actions 



direction of reward 


Figure C.9: Actions are performed until the desired target situation is achieved. Attention should 
be paid to numbering: Rewards are numbered beginning with 1, actions and situations beginning 
with 0 (This has simply been adopted as a convention). 


learning is: II can be initialized arbitrar¬ 
ily, and by means of Q learning the result 
is always Q*. 

Definition C.13 (Q learning). Q learn¬ 
ing trains the action-value function by 
means of the learning rule 


Q(s t ,a) new =Q(s t ,a) (C. 15) 

+ a(r t +i + 7 max Q(st+i, a) — Q(s t , 

a 

and thus finds Q* in any case. 


C.3 Example applications 

C.3.1 TD gammon 

TD gammon is a very successful 
backgammon game based on TD learn¬ 
ing invented by Gerald Tesauro. The 
situation here is the current configura¬ 
tion of the board. Anyone who has ever 


played backgammon knows that the situ¬ 
ation space is huge (approx. 10 20 situa¬ 
tions). As a result, the state-value func¬ 
tions cannot be computed explicitly (par¬ 
ticularly in the late eighties when TD gam¬ 
mon was introduced). The selected re¬ 
warding strategy was the pure delayed re¬ 
ward , i.e. the system receives the reward 
not before the end of the game and at the 
same time the reward is the return. Then 
e system was allowed to practice itself 
(initially against a backgammon program, 
then against an entity of itself). The result 
was that it achieved the highest ranking in 
a computer-backgammon league and strik¬ 
ingly disproved the theory that a computer 
programm is not capable to master a task 
better than its programmer. 

C.3.2 The car in the pit 

Let us take a look at a car parking on a 
one-dimensional road at the bottom of a 
deep pit without being able to get over 
the slope on both sides straight away by 
means of its engine power in order to leave 
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the pit. Trivially, the executable actions 
here are the possibilities to drive forwards 
and backwards. The intuitive solution we 
think of immediately is to move backwards, 
to gain momentum at the opposite slope 
and oscillate in this way several times to 
dash out of the pit. 

The actions of a reinforcement learning 
system would be "full throttle forward", 
"full reverse" and "doing nothing". 

Here, "everything costs" would be a good 
choice for awarding the reward so that the 
system learns fast how to leave the pit and 
realizes that our problem cannot be solved 
by means of mere forward directed engine 
power. So the system will slowly build up 
the movement. 

The policy can no longer be stored as a 
table since the state space is hard to dis¬ 
cretize. As policy a function has to be 
generated. 

C.3.3 The pole balancer 

The pole balancer was developed by 
Barto, Sutton and Anderson. 

Let be given a situation including a vehicle 
that is capable to move either to the right 
at full throttle or to the left at full throt¬ 
tle (bang bang control). Only these two 
actions can be performed, standing still 
is impossible. On the top of this car is 
hinged an upright pole that could tip over 
to both sides. The pole is built in such a 
way that it always tips over to one side so 
it never stands still (let us assume that the 
pole is rounded at the lower end). 


The angle of the pole relative to the verti¬ 
cal line is referred to as a. Furthermore, 
the vehicle always has a fixed position x an 
our one-dimensional world and a velocity 
of x. Our one-dinrensional world is lim¬ 
ited, i.e. there are maximum values and 
minimum values x can adopt. 

The aim of our system is to learn to steer 
the car in such a way that it can balance 
the pole, to prevent the pole from tipping 
over. This is achieved best by an avoid¬ 
ance strategy: As long as the pole is bal¬ 
anced the reward is 0. If the pole tips over, 
the reward is -1. 

Interestingly, the system is soon capable 
to keep the pole balanced by tilting it suf¬ 
ficiently fast and with small movements. 
At this the system mostly is in the cen¬ 
ter of the space since this is farthest from 
the walls which it understands as negative 
(if it touches the wall, the pole will tip 
over). 


C.3.3.1 Swinging up an inverted 
pendulum 

More difficult for the system is the fol¬ 
lowing initial situation: the pole initially 
hangs down, has to be swung up over the 
vehicle and finally has to be stabilized. In 
the literature this task is called swing up 
an inverted pendulum. 
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C.4 Reinforcement learning in connection with neural networks 


C.4 Reinforcement learning in 
connection with neural 
networks 

Finally, the reader would like to ask why a 
text on "neural networks" includes a chap¬ 
ter about reinforcement learning. 

The answer is very simple. We have al¬ 
ready been introduced to supervised and 
unsupervised learning procedures. Al¬ 
though we do not always have an om¬ 
niscient teacher who makes unsupervised 
learning possible, this does not mean that 
we do not receive any feedback at all. 
There is often something in between, some 
kind of criticism or school mark. Problems 
like this can be solved by means of rein¬ 
forcement learning. 

But not every problem is that easily solved 
like our gridworld: In our backgammon ex¬ 
ample we have approx. 10 20 situations and 
the situation tree has a large branching fac¬ 
tor, let alone other games. Here, the tables 
used in the gridworld can no longer be re¬ 
alized as state- and action-value functions. 
Thus, we have to find approximators for 
these functions. 

And which learning approximators for 
these reinforcement learning components 
come immediately into our mind? Exactly: 
neural networks. 

Exercises 

Exercise 19. A robot control system 
shall be persuaded by means of reinforce¬ 


ment learning to find a strategy in order 
to exit a maze as fast as possible. 

D> What could an appropriate state- 
value function look like? 

D> How would you generate an appropri¬ 
ate reward? 

Assume that the robot is capable to avoid 
obstacles and at any time knows its posi¬ 
tion (x, y) and orientation </>. 

Exercise 20. Describe the function of 
the two components ASE and ACE as 
they have been proposed by Barto, Sut¬ 
ton and Anderson to control the pole 
balancer. 

Bibliography: |BSA83 . 

Exercise 21. Indicate several "classical" 
problems of informatics which could be 
solved efficiently by means of reinforce¬ 
ment learning. Please give reasons for 
your answers. 
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